Ignore Warnings¶
import warnings
warnings.filterwarnings("ignore")
Importing required libraries¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max.rows',130)
pd.set_option('display.max.columns',130)
pd.set_option('float_format', '{:.2f}'.format)
Importing the dataset¶
df = pd.read_csv("application_data.csv")
# Checking few records from the dataframe
df.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.00 | 406597.50 | 24700.50 | 351000.00 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.02 | -9461 | -637 | -3648.00 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.00 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.08 | 0.26 | 0.14 | 0.02 | 0.04 | 0.97 | 0.62 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.63 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.62 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | reg oper account | block of flats | 0.01 | Stone, brick | No | 2.00 | 2.00 | 2.00 | 2.00 | -1134.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.00 | 1293502.50 | 35698.50 | 1129500.00 | Family | State servant | Higher education | Married | House / apartment | 0.00 | -16765 | -1188 | -1186.00 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.00 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.31 | 0.62 | NaN | 0.10 | 0.05 | 0.99 | 0.80 | 0.06 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.05 | 0.00 | 0.01 | 0.09 | 0.05 | 0.99 | 0.80 | 0.05 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.06 | 0.00 | 0.00 | 0.10 | 0.05 | 0.99 | 0.80 | 0.06 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.06 | 0.00 | 0.01 | reg oper account | block of flats | 0.07 | Block | No | 1.00 | 0.00 | 1.00 | 0.00 | -828.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.00 | 135000.00 | 6750.00 | 135000.00 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.01 | -19046 | -225 | -4260.00 | -2531 | 26.00 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.00 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.56 | 0.73 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | -815.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.00 | 312682.50 | 29686.50 | 297000.00 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.01 | -19005 | -3039 | -9833.00 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.00 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.65 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.00 | 0.00 | 2.00 | 0.00 | -617.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.00 | 513000.00 | 21865.50 | 513000.00 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.03 | -19932 | -3038 | -4311.00 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.00 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.32 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | -1106.00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Check structure of the data¶
df.info(verbose = True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 122 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 TARGET int64 2 NAME_CONTRACT_TYPE object 3 CODE_GENDER object 4 FLAG_OWN_CAR object 5 FLAG_OWN_REALTY object 6 CNT_CHILDREN int64 7 AMT_INCOME_TOTAL float64 8 AMT_CREDIT float64 9 AMT_ANNUITY float64 10 AMT_GOODS_PRICE float64 11 NAME_TYPE_SUITE object 12 NAME_INCOME_TYPE object 13 NAME_EDUCATION_TYPE object 14 NAME_FAMILY_STATUS object 15 NAME_HOUSING_TYPE object 16 REGION_POPULATION_RELATIVE float64 17 DAYS_BIRTH int64 18 DAYS_EMPLOYED int64 19 DAYS_REGISTRATION float64 20 DAYS_ID_PUBLISH int64 21 OWN_CAR_AGE float64 22 FLAG_MOBIL int64 23 FLAG_EMP_PHONE int64 24 FLAG_WORK_PHONE int64 25 FLAG_CONT_MOBILE int64 26 FLAG_PHONE int64 27 FLAG_EMAIL int64 28 OCCUPATION_TYPE object 29 CNT_FAM_MEMBERS float64 30 REGION_RATING_CLIENT int64 31 REGION_RATING_CLIENT_W_CITY int64 32 WEEKDAY_APPR_PROCESS_START object 33 HOUR_APPR_PROCESS_START int64 34 REG_REGION_NOT_LIVE_REGION int64 35 REG_REGION_NOT_WORK_REGION int64 36 LIVE_REGION_NOT_WORK_REGION int64 37 REG_CITY_NOT_LIVE_CITY int64 38 REG_CITY_NOT_WORK_CITY int64 39 LIVE_CITY_NOT_WORK_CITY int64 40 ORGANIZATION_TYPE object 41 EXT_SOURCE_1 float64 42 EXT_SOURCE_2 float64 43 EXT_SOURCE_3 float64 44 APARTMENTS_AVG float64 45 BASEMENTAREA_AVG float64 46 YEARS_BEGINEXPLUATATION_AVG float64 47 YEARS_BUILD_AVG float64 48 COMMONAREA_AVG float64 49 ELEVATORS_AVG float64 50 ENTRANCES_AVG float64 51 FLOORSMAX_AVG float64 52 FLOORSMIN_AVG float64 53 LANDAREA_AVG float64 54 LIVINGAPARTMENTS_AVG float64 55 LIVINGAREA_AVG float64 56 NONLIVINGAPARTMENTS_AVG float64 57 NONLIVINGAREA_AVG float64 58 APARTMENTS_MODE float64 59 BASEMENTAREA_MODE float64 60 YEARS_BEGINEXPLUATATION_MODE float64 61 YEARS_BUILD_MODE float64 62 COMMONAREA_MODE float64 63 ELEVATORS_MODE float64 64 ENTRANCES_MODE float64 65 FLOORSMAX_MODE float64 66 FLOORSMIN_MODE float64 67 LANDAREA_MODE float64 68 LIVINGAPARTMENTS_MODE float64 69 LIVINGAREA_MODE float64 70 NONLIVINGAPARTMENTS_MODE float64 71 NONLIVINGAREA_MODE float64 72 APARTMENTS_MEDI float64 73 BASEMENTAREA_MEDI float64 74 YEARS_BEGINEXPLUATATION_MEDI float64 75 YEARS_BUILD_MEDI float64 76 COMMONAREA_MEDI float64 77 ELEVATORS_MEDI float64 78 ENTRANCES_MEDI float64 79 FLOORSMAX_MEDI float64 80 FLOORSMIN_MEDI float64 81 LANDAREA_MEDI float64 82 LIVINGAPARTMENTS_MEDI float64 83 LIVINGAREA_MEDI float64 84 NONLIVINGAPARTMENTS_MEDI float64 85 NONLIVINGAREA_MEDI float64 86 FONDKAPREMONT_MODE object 87 HOUSETYPE_MODE object 88 TOTALAREA_MODE float64 89 WALLSMATERIAL_MODE object 90 EMERGENCYSTATE_MODE object 91 OBS_30_CNT_SOCIAL_CIRCLE float64 92 DEF_30_CNT_SOCIAL_CIRCLE float64 93 OBS_60_CNT_SOCIAL_CIRCLE float64 94 DEF_60_CNT_SOCIAL_CIRCLE float64 95 DAYS_LAST_PHONE_CHANGE float64 96 FLAG_DOCUMENT_2 int64 97 FLAG_DOCUMENT_3 int64 98 FLAG_DOCUMENT_4 int64 99 FLAG_DOCUMENT_5 int64 100 FLAG_DOCUMENT_6 int64 101 FLAG_DOCUMENT_7 int64 102 FLAG_DOCUMENT_8 int64 103 FLAG_DOCUMENT_9 int64 104 FLAG_DOCUMENT_10 int64 105 FLAG_DOCUMENT_11 int64 106 FLAG_DOCUMENT_12 int64 107 FLAG_DOCUMENT_13 int64 108 FLAG_DOCUMENT_14 int64 109 FLAG_DOCUMENT_15 int64 110 FLAG_DOCUMENT_16 int64 111 FLAG_DOCUMENT_17 int64 112 FLAG_DOCUMENT_18 int64 113 FLAG_DOCUMENT_19 int64 114 FLAG_DOCUMENT_20 int64 115 FLAG_DOCUMENT_21 int64 116 AMT_REQ_CREDIT_BUREAU_HOUR float64 117 AMT_REQ_CREDIT_BUREAU_DAY float64 118 AMT_REQ_CREDIT_BUREAU_WEEK float64 119 AMT_REQ_CREDIT_BUREAU_MON float64 120 AMT_REQ_CREDIT_BUREAU_QRT float64 121 AMT_REQ_CREDIT_BUREAU_YEAR float64 dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
We do not see any columns with Nullable values
df.shape
(307511, 122)
There are ~307k rows and 122 columns
Get statistical summary for numerical variables¶
df.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 168797.92 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.00 | 0.82 | 0.20 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 237123.15 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.00 | 0.38 | 0.40 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.00 | 0.06 | 0.00 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 25650.00 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 112500.00 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 147150.00 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 202500.00 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 117000000.00 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
Analyzing categorical variables¶
df.select_dtypes(include = "object").columns
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
dtype='object')
# Checking number of categorical variables
len(df.select_dtypes(include = "object").columns)
16
There are 16 categorical variables
Analyzing numerical variables¶
df.select_dtypes(include=["int64","float64"]).columns
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
...
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object', length=106)
# Checking number of categorical variables
len(df.select_dtypes(include=["int64","float64"]).columns)
106
There are 106 numerical variables
df.select_dtypes(include=["int64","float64"])
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 0 | 202500.00 | 406597.50 | 24700.50 | 351000.00 | 0.02 | -9461 | -637 | -3648.00 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | 1.00 | 2 | 2 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0.08 | 0.26 | 0.14 | 0.02 | 0.04 | 0.97 | 0.62 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.63 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.62 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.01 | 2.00 | 2.00 | 2.00 | 2.00 | -1134.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 1 | 100003 | 0 | 0 | 270000.00 | 1293502.50 | 35698.50 | 1129500.00 | 0.00 | -16765 | -1188 | -1186.00 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | 2.00 | 1 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0.31 | 0.62 | NaN | 0.10 | 0.05 | 0.99 | 0.80 | 0.06 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.05 | 0.00 | 0.01 | 0.09 | 0.05 | 0.99 | 0.80 | 0.05 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.06 | 0.00 | 0.00 | 0.10 | 0.05 | 0.99 | 0.80 | 0.06 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.06 | 0.00 | 0.01 | 0.07 | 1.00 | 0.00 | 1.00 | 0.00 | -828.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2 | 100004 | 0 | 0 | 67500.00 | 135000.00 | 6750.00 | 135000.00 | 0.01 | -19046 | -225 | -4260.00 | -2531 | 26.00 | 1 | 1 | 1 | 1 | 1 | 0 | 1.00 | 2 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0.56 | 0.73 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | -815.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 3 | 100006 | 0 | 0 | 135000.00 | 312682.50 | 29686.50 | 297000.00 | 0.01 | -19005 | -3039 | -9833.00 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 2.00 | 2 | 2 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0.65 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.00 | 0.00 | 2.00 | 0.00 | -617.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | 0 | 121500.00 | 513000.00 | 21865.50 | 513000.00 | 0.03 | -19932 | -3038 | -4311.00 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 1.00 | 2 | 2 | 11 | 0 | 0 | 0 | 0 | 1 | 1 | NaN | 0.32 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | -1106.00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | 0 | 157500.00 | 254700.00 | 27558.00 | 225000.00 | 0.03 | -9327 | -236 | -8456.00 | -1982 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 1.00 | 1 | 1 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0.15 | 0.68 | NaN | 0.20 | 0.09 | 0.99 | 0.83 | 0.02 | 0.22 | 0.10 | 0.60 | 0.27 | 0.06 | 0.15 | 0.20 | 0.08 | 0.11 | 0.10 | 0.02 | 0.98 | 0.71 | 0.02 | 0.08 | 0.03 | 0.46 | 0.04 | 0.01 | 0.09 | 0.09 | 0.00 | 0.01 | 0.20 | 0.09 | 0.99 | 0.83 | 0.02 | 0.22 | 0.10 | 0.60 | 0.27 | 0.06 | 0.15 | 0.20 | 0.08 | 0.11 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | -273.00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307507 | 456252 | 0 | 0 | 72000.00 | 269550.00 | 12001.50 | 225000.00 | 0.03 | -20775 | 365243 | -4388.00 | -4090 | NaN | 1 | 0 | 0 | 1 | 1 | 0 | 1.00 | 2 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0.12 | NaN | 0.02 | 0.04 | 0.97 | 0.63 | 0.00 | 0.00 | 0.10 | 0.08 | 0.12 | 0.06 | 0.02 | 0.03 | 0.00 | 0.00 | 0.03 | 0.05 | 0.97 | 0.64 | 0.00 | 0.00 | 0.10 | 0.08 | 0.12 | 0.06 | 0.02 | 0.03 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.63 | 0.00 | 0.00 | 0.10 | 0.08 | 0.12 | 0.06 | 0.02 | 0.03 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307508 | 456253 | 0 | 0 | 153000.00 | 677664.00 | 29979.00 | 585000.00 | 0.01 | -14966 | -7921 | -6737.00 | -5150 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | 1.00 | 3 | 3 | 9 | 0 | 0 | 0 | 0 | 1 | 1 | 0.74 | 0.54 | 0.22 | 0.10 | 0.09 | 0.98 | 0.75 | 0.01 | 0.00 | 0.21 | 0.17 | 0.21 | NaN | 0.08 | 0.93 | 0.00 | 0.00 | 0.10 | 0.09 | 0.98 | 0.76 | 0.01 | 0.00 | 0.21 | 0.17 | 0.21 | NaN | 0.09 | 0.97 | 0.00 | 0.00 | 0.10 | 0.09 | 0.98 | 0.75 | 0.01 | 0.00 | 0.21 | 0.17 | 0.21 | NaN | 0.09 | 0.94 | 0.00 | 0.00 | 0.80 | 6.00 | 0.00 | 6.00 | 0.00 | -1909.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 |
| 307509 | 456254 | 1 | 0 | 171000.00 | 370107.00 | 20205.00 | 319500.00 | 0.01 | -11961 | -4786 | -2562.00 | -931 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 2.00 | 2 | 2 | 9 | 0 | 0 | 0 | 1 | 1 | 0 | NaN | 0.51 | 0.66 | 0.01 | NaN | 0.98 | NaN | NaN | NaN | 0.07 | 0.04 | NaN | NaN | NaN | 0.01 | NaN | NaN | 0.01 | NaN | 0.98 | NaN | NaN | NaN | 0.07 | 0.04 | NaN | NaN | NaN | 0.01 | NaN | NaN | 0.01 | NaN | 0.98 | NaN | NaN | NaN | 0.07 | 0.04 | NaN | NaN | NaN | 0.01 | NaN | NaN | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | -322.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 307510 | 456255 | 0 | 0 | 157500.00 | 675000.00 | 49117.50 | 675000.00 | 0.05 | -16856 | -1262 | -5128.00 | -410 | NaN | 1 | 1 | 1 | 1 | 1 | 0 | 2.00 | 1 | 1 | 20 | 0 | 0 | 0 | 0 | 1 | 1 | 0.73 | 0.71 | 0.11 | 0.07 | 0.05 | 0.99 | NaN | 0.02 | 0.08 | 0.07 | 0.38 | NaN | NaN | NaN | 0.08 | NaN | 0.00 | 0.08 | 0.05 | 0.99 | NaN | 0.02 | 0.08 | 0.07 | 0.38 | NaN | NaN | NaN | 0.08 | NaN | 0.00 | 0.07 | 0.05 | 0.99 | NaN | 0.02 | 0.08 | 0.07 | 0.38 | NaN | NaN | NaN | 0.08 | NaN | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -787.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 1.00 |
307511 rows × 106 columns
Dealing with incorrect data types¶
Check if we have any column with incorrect data type
df.dtypes
SK_ID_CURR int64 TARGET int64 NAME_CONTRACT_TYPE object CODE_GENDER object FLAG_OWN_CAR object FLAG_OWN_REALTY object CNT_CHILDREN int64 AMT_INCOME_TOTAL float64 AMT_CREDIT float64 AMT_ANNUITY float64 AMT_GOODS_PRICE float64 NAME_TYPE_SUITE object NAME_INCOME_TYPE object NAME_EDUCATION_TYPE object NAME_FAMILY_STATUS object NAME_HOUSING_TYPE object REGION_POPULATION_RELATIVE float64 DAYS_BIRTH int64 DAYS_EMPLOYED int64 DAYS_REGISTRATION float64 DAYS_ID_PUBLISH int64 OWN_CAR_AGE float64 FLAG_MOBIL int64 FLAG_EMP_PHONE int64 FLAG_WORK_PHONE int64 FLAG_CONT_MOBILE int64 FLAG_PHONE int64 FLAG_EMAIL int64 OCCUPATION_TYPE object CNT_FAM_MEMBERS float64 REGION_RATING_CLIENT int64 REGION_RATING_CLIENT_W_CITY int64 WEEKDAY_APPR_PROCESS_START object HOUR_APPR_PROCESS_START int64 REG_REGION_NOT_LIVE_REGION int64 REG_REGION_NOT_WORK_REGION int64 LIVE_REGION_NOT_WORK_REGION int64 REG_CITY_NOT_LIVE_CITY int64 REG_CITY_NOT_WORK_CITY int64 LIVE_CITY_NOT_WORK_CITY int64 ORGANIZATION_TYPE object EXT_SOURCE_1 float64 EXT_SOURCE_2 float64 EXT_SOURCE_3 float64 APARTMENTS_AVG float64 BASEMENTAREA_AVG float64 YEARS_BEGINEXPLUATATION_AVG float64 YEARS_BUILD_AVG float64 COMMONAREA_AVG float64 ELEVATORS_AVG float64 ENTRANCES_AVG float64 FLOORSMAX_AVG float64 FLOORSMIN_AVG float64 LANDAREA_AVG float64 LIVINGAPARTMENTS_AVG float64 LIVINGAREA_AVG float64 NONLIVINGAPARTMENTS_AVG float64 NONLIVINGAREA_AVG float64 APARTMENTS_MODE float64 BASEMENTAREA_MODE float64 YEARS_BEGINEXPLUATATION_MODE float64 YEARS_BUILD_MODE float64 COMMONAREA_MODE float64 ELEVATORS_MODE float64 ENTRANCES_MODE float64 FLOORSMAX_MODE float64 FLOORSMIN_MODE float64 LANDAREA_MODE float64 LIVINGAPARTMENTS_MODE float64 LIVINGAREA_MODE float64 NONLIVINGAPARTMENTS_MODE float64 NONLIVINGAREA_MODE float64 APARTMENTS_MEDI float64 BASEMENTAREA_MEDI float64 YEARS_BEGINEXPLUATATION_MEDI float64 YEARS_BUILD_MEDI float64 COMMONAREA_MEDI float64 ELEVATORS_MEDI float64 ENTRANCES_MEDI float64 FLOORSMAX_MEDI float64 FLOORSMIN_MEDI float64 LANDAREA_MEDI float64 LIVINGAPARTMENTS_MEDI float64 LIVINGAREA_MEDI float64 NONLIVINGAPARTMENTS_MEDI float64 NONLIVINGAREA_MEDI float64 FONDKAPREMONT_MODE object HOUSETYPE_MODE object TOTALAREA_MODE float64 WALLSMATERIAL_MODE object EMERGENCYSTATE_MODE object OBS_30_CNT_SOCIAL_CIRCLE float64 DEF_30_CNT_SOCIAL_CIRCLE float64 OBS_60_CNT_SOCIAL_CIRCLE float64 DEF_60_CNT_SOCIAL_CIRCLE float64 DAYS_LAST_PHONE_CHANGE float64 FLAG_DOCUMENT_2 int64 FLAG_DOCUMENT_3 int64 FLAG_DOCUMENT_4 int64 FLAG_DOCUMENT_5 int64 FLAG_DOCUMENT_6 int64 FLAG_DOCUMENT_7 int64 FLAG_DOCUMENT_8 int64 FLAG_DOCUMENT_9 int64 FLAG_DOCUMENT_10 int64 FLAG_DOCUMENT_11 int64 FLAG_DOCUMENT_12 int64 FLAG_DOCUMENT_13 int64 FLAG_DOCUMENT_14 int64 FLAG_DOCUMENT_15 int64 FLAG_DOCUMENT_16 int64 FLAG_DOCUMENT_17 int64 FLAG_DOCUMENT_18 int64 FLAG_DOCUMENT_19 int64 FLAG_DOCUMENT_20 int64 FLAG_DOCUMENT_21 int64 AMT_REQ_CREDIT_BUREAU_HOUR float64 AMT_REQ_CREDIT_BUREAU_DAY float64 AMT_REQ_CREDIT_BUREAU_WEEK float64 AMT_REQ_CREDIT_BUREAU_MON float64 AMT_REQ_CREDIT_BUREAU_QRT float64 AMT_REQ_CREDIT_BUREAU_YEAR float64 dtype: object
df.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.00 | 406597.50 | 24700.50 | 351000.00 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.02 | -9461 | -637 | -3648.00 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.00 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.08 | 0.26 | 0.14 | 0.02 | 0.04 | 0.97 | 0.62 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.63 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.62 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | reg oper account | block of flats | 0.01 | Stone, brick | No | 2.00 | 2.00 | 2.00 | 2.00 | -1134.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.00 | 1293502.50 | 35698.50 | 1129500.00 | Family | State servant | Higher education | Married | House / apartment | 0.00 | -16765 | -1188 | -1186.00 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.00 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.31 | 0.62 | NaN | 0.10 | 0.05 | 0.99 | 0.80 | 0.06 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.05 | 0.00 | 0.01 | 0.09 | 0.05 | 0.99 | 0.80 | 0.05 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.06 | 0.00 | 0.00 | 0.10 | 0.05 | 0.99 | 0.80 | 0.06 | 0.08 | 0.03 | 0.29 | 0.33 | 0.01 | 0.08 | 0.06 | 0.00 | 0.01 | reg oper account | block of flats | 0.07 | Block | No | 1.00 | 0.00 | 1.00 | 0.00 | -828.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.00 | 135000.00 | 6750.00 | 135000.00 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.01 | -19046 | -225 | -4260.00 | -2531 | 26.00 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.00 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.56 | 0.73 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | -815.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.00 | 312682.50 | 29686.50 | 297000.00 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.01 | -19005 | -3039 | -9833.00 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.00 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.65 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.00 | 0.00 | 2.00 | 0.00 | -617.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.00 | 513000.00 | 21865.50 | 513000.00 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.03 | -19932 | -3038 | -4311.00 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.00 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.32 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | -1106.00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Looking at the data and their corresponding data types, we can conclude that No Data Type changes are required
Dealing with missing values¶
Check if we have any null values in the dataset
df.isnull().values.any()
True
Get Total number of null values in the dataset
df.isnull().values.sum()
9152465
Getting the list of column(s) which have null values
df.columns[df.isnull().any()]
Index(['AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'OWN_CAR_AGE',
'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG',
'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG',
'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE',
'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE',
'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE',
'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE',
'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE',
'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI',
'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI',
'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI',
'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE',
'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE',
'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object')
len(df.columns[df.isnull().any()])
67
There are totally 67 columns having one or more NULL values in the data
Computing count and percentage of missing values¶
null_count = df.isnull().sum()
null_percentage = round((df.isnull().sum()/df.shape[0])*100, 2)
null_df = pd.DataFrame({'column_name' : df.columns,'null_count' : null_count,'null_percentage': null_percentage})
null_df.reset_index(drop = True, inplace = True)
null_df.sort_values(by = 'null_percentage', ascending = False)
| column_name | null_count | null_percentage | |
|---|---|---|---|
| 76 | COMMONAREA_MEDI | 214865 | 69.87 |
| 48 | COMMONAREA_AVG | 214865 | 69.87 |
| 62 | COMMONAREA_MODE | 214865 | 69.87 |
| 70 | NONLIVINGAPARTMENTS_MODE | 213514 | 69.43 |
| 56 | NONLIVINGAPARTMENTS_AVG | 213514 | 69.43 |
| 84 | NONLIVINGAPARTMENTS_MEDI | 213514 | 69.43 |
| 86 | FONDKAPREMONT_MODE | 210295 | 68.39 |
| 68 | LIVINGAPARTMENTS_MODE | 210199 | 68.35 |
| 54 | LIVINGAPARTMENTS_AVG | 210199 | 68.35 |
| 82 | LIVINGAPARTMENTS_MEDI | 210199 | 68.35 |
| 52 | FLOORSMIN_AVG | 208642 | 67.85 |
| 66 | FLOORSMIN_MODE | 208642 | 67.85 |
| 80 | FLOORSMIN_MEDI | 208642 | 67.85 |
| 75 | YEARS_BUILD_MEDI | 204488 | 66.50 |
| 61 | YEARS_BUILD_MODE | 204488 | 66.50 |
| 47 | YEARS_BUILD_AVG | 204488 | 66.50 |
| 21 | OWN_CAR_AGE | 202929 | 65.99 |
| 81 | LANDAREA_MEDI | 182590 | 59.38 |
| 67 | LANDAREA_MODE | 182590 | 59.38 |
| 53 | LANDAREA_AVG | 182590 | 59.38 |
| 73 | BASEMENTAREA_MEDI | 179943 | 58.52 |
| 45 | BASEMENTAREA_AVG | 179943 | 58.52 |
| 59 | BASEMENTAREA_MODE | 179943 | 58.52 |
| 41 | EXT_SOURCE_1 | 173378 | 56.38 |
| 71 | NONLIVINGAREA_MODE | 169682 | 55.18 |
| 57 | NONLIVINGAREA_AVG | 169682 | 55.18 |
| 85 | NONLIVINGAREA_MEDI | 169682 | 55.18 |
| 77 | ELEVATORS_MEDI | 163891 | 53.30 |
| 49 | ELEVATORS_AVG | 163891 | 53.30 |
| 63 | ELEVATORS_MODE | 163891 | 53.30 |
| 89 | WALLSMATERIAL_MODE | 156341 | 50.84 |
| 72 | APARTMENTS_MEDI | 156061 | 50.75 |
| 44 | APARTMENTS_AVG | 156061 | 50.75 |
| 58 | APARTMENTS_MODE | 156061 | 50.75 |
| 78 | ENTRANCES_MEDI | 154828 | 50.35 |
| 50 | ENTRANCES_AVG | 154828 | 50.35 |
| 64 | ENTRANCES_MODE | 154828 | 50.35 |
| 55 | LIVINGAREA_AVG | 154350 | 50.19 |
| 69 | LIVINGAREA_MODE | 154350 | 50.19 |
| 83 | LIVINGAREA_MEDI | 154350 | 50.19 |
| 87 | HOUSETYPE_MODE | 154297 | 50.18 |
| 65 | FLOORSMAX_MODE | 153020 | 49.76 |
| 79 | FLOORSMAX_MEDI | 153020 | 49.76 |
| 51 | FLOORSMAX_AVG | 153020 | 49.76 |
| 60 | YEARS_BEGINEXPLUATATION_MODE | 150007 | 48.78 |
| 74 | YEARS_BEGINEXPLUATATION_MEDI | 150007 | 48.78 |
| 46 | YEARS_BEGINEXPLUATATION_AVG | 150007 | 48.78 |
| 88 | TOTALAREA_MODE | 148431 | 48.27 |
| 90 | EMERGENCYSTATE_MODE | 145755 | 47.40 |
| 28 | OCCUPATION_TYPE | 96391 | 31.35 |
| 43 | EXT_SOURCE_3 | 60965 | 19.83 |
| 116 | AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 13.50 |
| 117 | AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 13.50 |
| 118 | AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 13.50 |
| 119 | AMT_REQ_CREDIT_BUREAU_MON | 41519 | 13.50 |
| 120 | AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 13.50 |
| 121 | AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 13.50 |
| 11 | NAME_TYPE_SUITE | 1292 | 0.42 |
| 92 | DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 91 | OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 93 | OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 94 | DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 42 | EXT_SOURCE_2 | 660 | 0.21 |
| 10 | AMT_GOODS_PRICE | 278 | 0.09 |
| 6 | CNT_CHILDREN | 0 | 0.00 |
| 102 | FLAG_DOCUMENT_8 | 0 | 0.00 |
| 2 | NAME_CONTRACT_TYPE | 0 | 0.00 |
| 3 | CODE_GENDER | 0 | 0.00 |
| 4 | FLAG_OWN_CAR | 0 | 0.00 |
| 95 | DAYS_LAST_PHONE_CHANGE | 1 | 0.00 |
| 96 | FLAG_DOCUMENT_2 | 0 | 0.00 |
| 97 | FLAG_DOCUMENT_3 | 0 | 0.00 |
| 98 | FLAG_DOCUMENT_4 | 0 | 0.00 |
| 99 | FLAG_DOCUMENT_5 | 0 | 0.00 |
| 100 | FLAG_DOCUMENT_6 | 0 | 0.00 |
| 101 | FLAG_DOCUMENT_7 | 0 | 0.00 |
| 103 | FLAG_DOCUMENT_9 | 0 | 0.00 |
| 115 | FLAG_DOCUMENT_21 | 0 | 0.00 |
| 104 | FLAG_DOCUMENT_10 | 0 | 0.00 |
| 105 | FLAG_DOCUMENT_11 | 0 | 0.00 |
| 5 | FLAG_OWN_REALTY | 0 | 0.00 |
| 107 | FLAG_DOCUMENT_13 | 0 | 0.00 |
| 108 | FLAG_DOCUMENT_14 | 0 | 0.00 |
| 109 | FLAG_DOCUMENT_15 | 0 | 0.00 |
| 110 | FLAG_DOCUMENT_16 | 0 | 0.00 |
| 111 | FLAG_DOCUMENT_17 | 0 | 0.00 |
| 112 | FLAG_DOCUMENT_18 | 0 | 0.00 |
| 113 | FLAG_DOCUMENT_19 | 0 | 0.00 |
| 114 | FLAG_DOCUMENT_20 | 0 | 0.00 |
| 106 | FLAG_DOCUMENT_12 | 0 | 0.00 |
| 8 | AMT_CREDIT | 0 | 0.00 |
| 7 | AMT_INCOME_TOTAL | 0 | 0.00 |
| 26 | FLAG_PHONE | 0 | 0.00 |
| 39 | LIVE_CITY_NOT_WORK_CITY | 0 | 0.00 |
| 38 | REG_CITY_NOT_WORK_CITY | 0 | 0.00 |
| 1 | TARGET | 0 | 0.00 |
| 37 | REG_CITY_NOT_LIVE_CITY | 0 | 0.00 |
| 36 | LIVE_REGION_NOT_WORK_REGION | 0 | 0.00 |
| 35 | REG_REGION_NOT_WORK_REGION | 0 | 0.00 |
| 34 | REG_REGION_NOT_LIVE_REGION | 0 | 0.00 |
| 33 | HOUR_APPR_PROCESS_START | 0 | 0.00 |
| 32 | WEEKDAY_APPR_PROCESS_START | 0 | 0.00 |
| 31 | REGION_RATING_CLIENT_W_CITY | 0 | 0.00 |
| 30 | REGION_RATING_CLIENT | 0 | 0.00 |
| 29 | CNT_FAM_MEMBERS | 2 | 0.00 |
| 27 | FLAG_EMAIL | 0 | 0.00 |
| 25 | FLAG_CONT_MOBILE | 0 | 0.00 |
| 40 | ORGANIZATION_TYPE | 0 | 0.00 |
| 24 | FLAG_WORK_PHONE | 0 | 0.00 |
| 23 | FLAG_EMP_PHONE | 0 | 0.00 |
| 22 | FLAG_MOBIL | 0 | 0.00 |
| 20 | DAYS_ID_PUBLISH | 0 | 0.00 |
| 19 | DAYS_REGISTRATION | 0 | 0.00 |
| 18 | DAYS_EMPLOYED | 0 | 0.00 |
| 17 | DAYS_BIRTH | 0 | 0.00 |
| 16 | REGION_POPULATION_RELATIVE | 0 | 0.00 |
| 15 | NAME_HOUSING_TYPE | 0 | 0.00 |
| 14 | NAME_FAMILY_STATUS | 0 | 0.00 |
| 13 | NAME_EDUCATION_TYPE | 0 | 0.00 |
| 12 | NAME_INCOME_TYPE | 0 | 0.00 |
| 9 | AMT_ANNUITY | 12 | 0.00 |
| 0 | SK_ID_CURR | 0 | 0.00 |
Removing columns with NULL values > 40%¶
Getting list of columns with NULL values > 40% into a list. We will be removing these columns from the dataframe as there are too many missing values.
columns_to_be_deleted = null_df[null_df['null_percentage'] > 40].column_name.to_list()
columns_to_be_deleted
['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
len(columns_to_be_deleted)
49
There are totally 49 columns to be removed. Deleting them from main dataframe df
df.drop(columns = columns_to_be_deleted, inplace = True)
Checking column count post removal. Only 73 columns should be left
df.shape
(307511, 73)
Checking columns with NULL values < 40%¶
Creating dataframe null_df_under40 with missing column percentages under 40%
null_df_under40 = null_df[null_df['null_percentage'] < 40]
null_df_under40.sort_values(by = 'null_percentage', ascending = False)
| column_name | null_count | null_percentage | |
|---|---|---|---|
| 28 | OCCUPATION_TYPE | 96391 | 31.35 |
| 43 | EXT_SOURCE_3 | 60965 | 19.83 |
| 121 | AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 13.50 |
| 120 | AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 13.50 |
| 119 | AMT_REQ_CREDIT_BUREAU_MON | 41519 | 13.50 |
| 118 | AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 13.50 |
| 117 | AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 13.50 |
| 116 | AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 13.50 |
| 11 | NAME_TYPE_SUITE | 1292 | 0.42 |
| 91 | OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 92 | DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 93 | OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 94 | DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 42 | EXT_SOURCE_2 | 660 | 0.21 |
| 10 | AMT_GOODS_PRICE | 278 | 0.09 |
| 101 | FLAG_DOCUMENT_7 | 0 | 0.00 |
| 95 | DAYS_LAST_PHONE_CHANGE | 1 | 0.00 |
| 96 | FLAG_DOCUMENT_2 | 0 | 0.00 |
| 97 | FLAG_DOCUMENT_3 | 0 | 0.00 |
| 98 | FLAG_DOCUMENT_4 | 0 | 0.00 |
| 99 | FLAG_DOCUMENT_5 | 0 | 0.00 |
| 100 | FLAG_DOCUMENT_6 | 0 | 0.00 |
| 105 | FLAG_DOCUMENT_11 | 0 | 0.00 |
| 102 | FLAG_DOCUMENT_8 | 0 | 0.00 |
| 103 | FLAG_DOCUMENT_9 | 0 | 0.00 |
| 104 | FLAG_DOCUMENT_10 | 0 | 0.00 |
| 39 | LIVE_CITY_NOT_WORK_CITY | 0 | 0.00 |
| 106 | FLAG_DOCUMENT_12 | 0 | 0.00 |
| 107 | FLAG_DOCUMENT_13 | 0 | 0.00 |
| 108 | FLAG_DOCUMENT_14 | 0 | 0.00 |
| 109 | FLAG_DOCUMENT_15 | 0 | 0.00 |
| 110 | FLAG_DOCUMENT_16 | 0 | 0.00 |
| 111 | FLAG_DOCUMENT_17 | 0 | 0.00 |
| 112 | FLAG_DOCUMENT_18 | 0 | 0.00 |
| 113 | FLAG_DOCUMENT_19 | 0 | 0.00 |
| 114 | FLAG_DOCUMENT_20 | 0 | 0.00 |
| 115 | FLAG_DOCUMENT_21 | 0 | 0.00 |
| 40 | ORGANIZATION_TYPE | 0 | 0.00 |
| 0 | SK_ID_CURR | 0 | 0.00 |
| 38 | REG_CITY_NOT_WORK_CITY | 0 | 0.00 |
| 9 | AMT_ANNUITY | 12 | 0.00 |
| 17 | DAYS_BIRTH | 0 | 0.00 |
| 16 | REGION_POPULATION_RELATIVE | 0 | 0.00 |
| 15 | NAME_HOUSING_TYPE | 0 | 0.00 |
| 14 | NAME_FAMILY_STATUS | 0 | 0.00 |
| 13 | NAME_EDUCATION_TYPE | 0 | 0.00 |
| 12 | NAME_INCOME_TYPE | 0 | 0.00 |
| 8 | AMT_CREDIT | 0 | 0.00 |
| 1 | TARGET | 0 | 0.00 |
| 7 | AMT_INCOME_TOTAL | 0 | 0.00 |
| 6 | CNT_CHILDREN | 0 | 0.00 |
| 5 | FLAG_OWN_REALTY | 0 | 0.00 |
| 4 | FLAG_OWN_CAR | 0 | 0.00 |
| 3 | CODE_GENDER | 0 | 0.00 |
| 2 | NAME_CONTRACT_TYPE | 0 | 0.00 |
| 18 | DAYS_EMPLOYED | 0 | 0.00 |
| 19 | DAYS_REGISTRATION | 0 | 0.00 |
| 20 | DAYS_ID_PUBLISH | 0 | 0.00 |
| 22 | FLAG_MOBIL | 0 | 0.00 |
| 23 | FLAG_EMP_PHONE | 0 | 0.00 |
| 24 | FLAG_WORK_PHONE | 0 | 0.00 |
| 25 | FLAG_CONT_MOBILE | 0 | 0.00 |
| 26 | FLAG_PHONE | 0 | 0.00 |
| 27 | FLAG_EMAIL | 0 | 0.00 |
| 29 | CNT_FAM_MEMBERS | 2 | 0.00 |
| 30 | REGION_RATING_CLIENT | 0 | 0.00 |
| 31 | REGION_RATING_CLIENT_W_CITY | 0 | 0.00 |
| 32 | WEEKDAY_APPR_PROCESS_START | 0 | 0.00 |
| 33 | HOUR_APPR_PROCESS_START | 0 | 0.00 |
| 34 | REG_REGION_NOT_LIVE_REGION | 0 | 0.00 |
| 35 | REG_REGION_NOT_WORK_REGION | 0 | 0.00 |
| 36 | LIVE_REGION_NOT_WORK_REGION | 0 | 0.00 |
| 37 | REG_CITY_NOT_LIVE_CITY | 0 | 0.00 |
Analysis of OCCUPATION_TYPE column¶
- nullable values = 31.35%
df['OCCUPATION_TYPE'].value_counts()
Laborers 55186 Sales staff 32102 Core staff 27570 Managers 21371 Drivers 18603 High skill tech staff 11380 Accountants 9813 Medicine staff 8537 Security staff 6721 Cooking staff 5946 Cleaning staff 4653 Private service staff 2652 Low-skill Laborers 2093 Waiters/barmen staff 1348 Secretaries 1305 Realty agents 751 HR staff 563 IT staff 526 Name: OCCUPATION_TYPE, dtype: int64
Impute NULL values with Unknown category
df['OCCUPATION_TYPE'].fillna(value = 'Unknown', inplace = True)
plt.figure(figsize = (10,5))
sns.countplot(data = df, x = "OCCUPATION_TYPE")
plt.xticks(rotation = 90)
plt.show()
Observations
- Looking at the plot,
Laborershas the highest number of loan applicants - For imputation, it would be better to leave the data as is (missing values being 31.35%) and not impute to min/min/mode/median as it may bias the data in later computations
Analysis of EXT_SOURCE_3 column¶
- nullable values = 19.83%
df.EXT_SOURCE_3.value_counts().head()
0.75 1460 0.71 1315 0.69 1276 0.67 1191 0.65 1154 Name: EXT_SOURCE_3, dtype: int64
sns.boxplot(df.EXT_SOURCE_3)
plt.show()
Getting percentile values for EXT_SOURCE_3
df.EXT_SOURCE_3.quantile(q = [0.25,0.5,0.75,1])
0.25 0.37 0.50 0.54 0.75 0.67 1.00 0.90 Name: EXT_SOURCE_3, dtype: float64
Most recurring value in EXT_SOURCE_3
df.EXT_SOURCE_3.mode()[0]
0.746300213050371
Checking the average value of EXT_SOURCE_3
df.EXT_SOURCE_3.mean()
0.5108529061800121
Observations
- Looking at the boxplot, median is 0.535276
- Most recurring value is 0.746300213050371
- Mean value is 0.5108529061800121
- Though mean and median are closer and could be used for imputation, since missing percentage value is higher (19.83%), it would be better to leave the data as it is and not perform imputations
Analysis of six columns with 13.5% missing values¶
AMT_REQ_CREDIT_BUREAU_YEARAMT_REQ_CREDIT_BUREAU_QRTAMT_REQ_CREDIT_BUREAU_MONAMT_REQ_CREDIT_BUREAU_WEEKAMT_REQ_CREDIT_BUREAU_DAYAMT_REQ_CREDIT_BUREAU_HOUR
nullable values = 13.50%
Looking at summary statistics for the columns
df[['AMT_REQ_CREDIT_BUREAU_YEAR',
'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_HOUR']].describe()
| AMT_REQ_CREDIT_BUREAU_YEAR | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_HOUR | |
|---|---|---|---|---|---|---|
| count | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 1.90 | 0.27 | 0.27 | 0.03 | 0.01 | 0.01 |
| std | 1.87 | 0.79 | 0.92 | 0.20 | 0.11 | 0.08 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 75% | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| max | 25.00 | 261.00 | 27.00 | 8.00 | 9.00 | 4.00 |
Most recurring value for the columns
df[['AMT_REQ_CREDIT_BUREAU_YEAR',
'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_HOUR']].mode()
| AMT_REQ_CREDIT_BUREAU_YEAR | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_HOUR | |
|---|---|---|---|---|---|---|
| 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Observations for the below columns
AMT_REQ_CREDIT_BUREAU_YEAR- Mean is 1.899974 and Median is 1.000000
- Mode is 0
- Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
AMT_REQ_CREDIT_BUREAU_QRT- Mean is 0.265474 and Median is 0
- Mode is 0
- Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
AMT_REQ_CREDIT_BUREAU_MON- Mean is 0.267395 and Median is 0
- Mode is 0
- Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
AMT_REQ_CREDIT_BUREAU_WEEK- Mean is 0.034362 and Median is 0
- Mode is 0
- Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
AMT_REQ_CREDIT_BUREAU_DAY- Mean is 0.007000 and Median is 0
- Mode is 0
- Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
AMT_REQ_CREDIT_BUREAU_HOUR- Mean is 0.006402 and Median is 0
- Mode is 0
- Don't impute as nullable values are higher (13.50%) and it might introduce bias in the data
Checking columns with NULL values > 0% and < 1%¶
Creating dataframe null_df_under1 with missing column percentages values > 0% and < 1%
null_df_under1 = null_df[(null_df['null_percentage'] > 0) & (null_df['null_percentage'] < 1)]
null_df_under1.sort_values(by = 'null_percentage', ascending = False)
| column_name | null_count | null_percentage | |
|---|---|---|---|
| 11 | NAME_TYPE_SUITE | 1292 | 0.42 |
| 91 | OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 92 | DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 93 | OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 94 | DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.33 |
| 42 | EXT_SOURCE_2 | 660 | 0.21 |
| 10 | AMT_GOODS_PRICE | 278 | 0.09 |
Analysis of NAME_TYPE_SUITE column¶
- nullable values = 0.42%
df['NAME_TYPE_SUITE'].value_counts()
Unaccompanied 248526 Family 40149 Spouse, partner 11370 Children 3267 Other_B 1770 Other_A 866 Group of people 271 Name: NAME_TYPE_SUITE, dtype: int64
plt.figure(figsize = (10,5))
sns.countplot(data = df, x = "NAME_TYPE_SUITE")
plt.xticks(rotation = 90)
plt.show()
Observations
- Looking at the plot,
Unaccompaniedcategory has the highest number of loan applicants. So, most of the loan applicants venture out alone for applying loan - We can go ahead and impute
Unaccompaniedin the dataframe
Analysis of OBS_30_CNT_SOCIAL_CIRCLE column¶
- nullable values = 0.33%
df.OBS_30_CNT_SOCIAL_CIRCLE.value_counts().head()
0.00 163910 1.00 48783 2.00 29808 3.00 20322 4.00 14143 Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: int64
sns.boxplot(df.OBS_30_CNT_SOCIAL_CIRCLE)
plt.show()
Getting percentile values for OBS_30_CNT_SOCIAL_CIRCLE
df.OBS_30_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,1])
0.25 0.00 0.50 0.00 0.75 2.00 1.00 348.00 Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64
Most recurring value in OBS_30_CNT_SOCIAL_CIRCLE
df.OBS_30_CNT_SOCIAL_CIRCLE.mode()[0]
0.0
Checking the average value of OBS_30_CNT_SOCIAL_CIRCLE
df.OBS_30_CNT_SOCIAL_CIRCLE.mean()
1.4222454239942575
Observations
- Looking at the boxplot, median is 0.0
- Most recurring value is 0.0
- Mean value is 1.4222454239942575
- There are two outlier values at 50 and 350.
- Mean and mode are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)
Analysis of DEF_30_CNT_SOCIAL_CIRCLE column¶
- nullable values = 0.33%
df.DEF_30_CNT_SOCIAL_CIRCLE.value_counts().head()
0.00 271324 1.00 28328 2.00 5323 3.00 1192 4.00 253 Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: int64
sns.boxplot(df.DEF_30_CNT_SOCIAL_CIRCLE)
plt.show()
Getting percentile values for DEF_30_CNT_SOCIAL_CIRCLE
df.DEF_30_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])
0.25 0.00 0.50 0.00 0.75 0.00 0.99 2.00 1.00 34.00 Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: float64
Most recurring value in DEF_30_CNT_SOCIAL_CIRCLE
df.DEF_30_CNT_SOCIAL_CIRCLE.mode()[0]
0.0
Checking the average value of DEF_30_CNT_SOCIAL_CIRCLE
df.DEF_30_CNT_SOCIAL_CIRCLE.mean()
0.1434206662533851
Observations
- Looking at the boxplot, median is 0.0
- Most recurring value is 0.0
- Mean value is 0.1434206662533851
- Even 99th percentile value is 2. There are ~7 outliers the largest of which is ~33.
- Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)
Analysis of OBS_60_CNT_SOCIAL_CIRCLE column¶
- nullable values = 0.33%
df.OBS_60_CNT_SOCIAL_CIRCLE.value_counts().head()
0.00 164666 1.00 48870 2.00 29766 3.00 20215 4.00 13946 Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: int64
sns.boxplot(df.OBS_60_CNT_SOCIAL_CIRCLE)
plt.show()
Getting percentile values for OBS_60_CNT_SOCIAL_CIRCLE
df.OBS_60_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])
0.25 0.00 0.50 0.00 0.75 2.00 0.99 10.00 1.00 344.00 Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: float64
Most recurring value in OBS_60_CNT_SOCIAL_CIRCLE
df.OBS_60_CNT_SOCIAL_CIRCLE.mode()[0]
0.0
Checking the average value of OBS_60_CNT_SOCIAL_CIRCLE
df.OBS_60_CNT_SOCIAL_CIRCLE.mean()
1.4052921791901856
Observations
- Looking at the boxplot, median is 0.0
- Most recurring value is 0.0
- Mean value is 1.4052921791901856
- Even 99th percentile value is 10. There is a prominent outlier at 50 and 350 approximately.
- Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)
Analysis of DEF_60_CNT_SOCIAL_CIRCLE column¶
- nullable values = 0.33%
df.DEF_60_CNT_SOCIAL_CIRCLE.value_counts().head()
0.00 280721 1.00 21841 2.00 3170 3.00 598 4.00 135 Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: int64
sns.boxplot(df.DEF_60_CNT_SOCIAL_CIRCLE)
plt.show()
Getting percentile values for DEF_60_CNT_SOCIAL_CIRCLE
df.DEF_60_CNT_SOCIAL_CIRCLE.quantile(q = [0.25,0.5,0.75,0.99,1])
0.25 0.00 0.50 0.00 0.75 0.00 0.99 2.00 1.00 24.00 Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: float64
Most recurring value in DEF_60_CNT_SOCIAL_CIRCLE
df.DEF_60_CNT_SOCIAL_CIRCLE.mode()[0]
0.0
Checking the average value of DEF_60_CNT_SOCIAL_CIRCLE
df.DEF_60_CNT_SOCIAL_CIRCLE.mean()
0.10004894123788705
Observations
- Looking at the boxplot, median is 0.0
- Most recurring value is 0.0
- Mean value is 0.10004894123788705
- Even 99th percentile value is 2. There are ~7 outliers the largest of which is ~24.
- Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.33%)
Analysis of EXT_SOURCE_2 column¶
- nullable values = 0.21%
df.EXT_SOURCE_2.value_counts().head()
0.29 721 0.26 417 0.27 343 0.16 322 0.27 306 Name: EXT_SOURCE_2, dtype: int64
sns.boxplot(df.EXT_SOURCE_2)
plt.show()
Getting percentile values for EXT_SOURCE_2
df.EXT_SOURCE_2.quantile(q = [0.25,0.5,0.75,0.99,1])
0.25 0.39 0.50 0.57 0.75 0.66 0.99 0.78 1.00 0.85 Name: EXT_SOURCE_2, dtype: float64
Most recurring value in EXT_SOURCE_2
df.EXT_SOURCE_2.mode()[0]
0.2858978721410488
Checking the average value of EXT_SOURCE_2
df.EXT_SOURCE_2.mean()
0.5143926741308463
Observations
- Looking at the boxplot, median is 0.565961
- Most recurring value is 0.2858978721410488
- Mean value is 0.5143926741308463
- There is no outlier in the dataset
- Mean and median are closer and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.21%)
Analysis of AMT_GOODS_PRICE column¶
- nullable values = 0.09%
df.AMT_GOODS_PRICE.value_counts().head()
450000.00 26022 225000.00 25282 675000.00 24962 900000.00 15416 270000.00 11428 Name: AMT_GOODS_PRICE, dtype: int64
sns.boxplot(df.AMT_GOODS_PRICE)
plt.show()
Getting percentile values for AMT_GOODS_PRICE
df.AMT_GOODS_PRICE.quantile(q = [0.25,0.5,0.75,0.99,1])
0.25 238500.00 0.50 450000.00 0.75 679500.00 0.99 1800000.00 1.00 4050000.00 Name: AMT_GOODS_PRICE, dtype: float64
Most recurring value in AMT_GOODS_PRICE
df.AMT_GOODS_PRICE.mode()[0]
450000.0
Checking the average value of AMT_GOODS_PRICE
df.AMT_GOODS_PRICE.mean()
538396.2074288895
Observations
- Looking at the boxplot, median is 450000.0
- Most recurring value is 450000.0. So, median and mode are the same
- Mean value is 538396.2074288895
- Though there are values above 2500000 they cannot be treated as outliers as it could be a valid goods price
- Mean and median are exactly same and can be used for imputation. It will not introduce bias as the missing value percentage is small (0.09%)
Dealing with incorrect/unknown data values¶
Analysis of CODE_GENDER column¶
Checking range of values
df['CODE_GENDER'].value_counts()
F 202448 M 105059 XNA 4 Name: CODE_GENDER, dtype: int64
Gender should only be Male or Female. XNA value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application
df[df['CODE_GENDER'] == 'XNA']
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35657 | 141289 | 0 | Revolving loans | XNA | Y | Y | 0 | 207000.00 | 382500.00 | 19125.00 | 337500.00 | Unaccompanied | Working | Secondary / secondary special | Married | Municipal apartment | 0.02 | -20232 | -10044 | -10024.00 | -3537 | 1 | 1 | 1 | 1 | 1 | 0 | Unknown | 2.00 | 3 | 3 | TUESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 1 | 0.30 | 0.46 | 0.00 | 0.00 | 0.00 | 0.00 | -286.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 38566 | 144669 | 0 | Revolving loans | XNA | N | Y | 2 | 157500.00 | 270000.00 | 13500.00 | 225000.00 | Family | Working | Secondary / secondary special | Married | House / apartment | 0.03 | -13717 | -2797 | -2241.00 | -4659 | 1 | 1 | 1 | 1 | 1 | 0 | Low-skill Laborers | 4.00 | 2 | 2 | FRIDAY | 16 | 0 | 0 | 0 | 0 | 0 | 0 | Industry: type 3 | 0.71 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | -493.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 3.00 | 0.00 | 4.00 |
| 83382 | 196708 | 0 | Revolving loans | XNA | N | Y | 1 | 135000.00 | 405000.00 | 20250.00 | 225000.00 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.04 | -10647 | -1228 | -183.00 | -1671 | 1 | 1 | 1 | 1 | 1 | 0 | Core staff | 3.00 | 2 | 2 | THURSDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.66 | 0.08 | 7.00 | 1.00 | 7.00 | 1.00 | -851.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| 189640 | 319880 | 0 | Revolving loans | XNA | Y | Y | 0 | 247500.00 | 540000.00 | 27000.00 | 900000.00 | Unaccompanied | Commercial associate | Incomplete higher | Civil marriage | House / apartment | 0.04 | -9649 | -2293 | -4099.00 | -2326 | 1 | 1 | 1 | 1 | 1 | 0 | Unknown | 2.00 | 2 | 2 | FRIDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Medicine | 0.66 | 0.36 | 10.00 | 4.00 | 10.00 | 4.00 | -1681.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 6.00 |
As data looks valid, we will go ahead and check for an imputation method.
Femaleapplicants are twice the number ofMaleapplicants- And so, we will go ahead and impute
CODE_GENDERwith 'F'
df['CODE_GENDER'] = df['CODE_GENDER'].apply(lambda x: 'F' if x == 'XNA' else x)
Checking if XNA is removed
df['CODE_GENDER'].value_counts()
F 202452 M 105059 Name: CODE_GENDER, dtype: int64
Analysis of DAYS_BIRTH column¶
df['DAYS_BIRTH'].value_counts().head()
-13749 43 -13481 42 -10020 41 -18248 41 -15771 40 Name: DAYS_BIRTH, dtype: int64
There are ~17K+ unique records all of which seem to be having negative values
df['DAYS_BIRTH'].unique()
array([ -9461, -16765, -19046, ..., -7951, -7857, -25061])
df['DAYS_BIRTH'].nunique()
17460
Converting Days Birth to positive days
df['DAYS_BIRTH'] = df['DAYS_BIRTH'].apply(lambda x: -x if x < 0 else x)
df['DAYS_BIRTH'].value_counts()
13749 43
13481 42
18248 41
10020 41
15771 40
..
24880 1
7934 1
16436 1
25177 1
24517 1
Name: DAYS_BIRTH, Length: 17460, dtype: int64
All Days in DAYS_BIRTH have positive values
Creating a new column YEARS_BIRTH for ease of analysis¶
df['YEARS_BIRTH'] = df['DAYS_BIRTH'].apply(lambda x: round(x/365))
Analysis of NAME_FAMILY_STATUS column¶
Checking range of values
df['NAME_FAMILY_STATUS'].value_counts()
Married 196432 Single / not married 45444 Civil marriage 29775 Separated 19770 Widow 16088 Unknown 2 Name: NAME_FAMILY_STATUS, dtype: int64
Gender should only be Male or Female. Unknown value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application
df[df['NAME_FAMILY_STATUS'] == 'Unknown']
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEARS_BIRTH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41982 | 148605 | 0 | Revolving loans | M | N | Y | 0 | 450000.00 | 675000.00 | 33750.00 | NaN | NaN | Commercial associate | Lower secondary | Unknown | Municipal apartment | 0.02 | 12396 | -1161 | -3265.00 | -4489 | 1 | 1 | 1 | 1 | 1 | 0 | Managers | NaN | 2 | 2 | THURSDAY | 15 | 0 | 1 | 1 | 0 | 1 | 1 | Insurance | 0.70 | NaN | 3.00 | 0.00 | 3.00 | 0.00 | -876.00 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 34 |
| 187348 | 317181 | 0 | Revolving loans | F | N | Y | 0 | 202500.00 | 585000.00 | 29250.00 | NaN | NaN | Commercial associate | Higher education | Unknown | House / apartment | 0.03 | 12844 | -232 | -1597.00 | -1571 | 1 | 1 | 0 | 1 | 0 | 0 | Accountants | NaN | 2 | 2 | FRIDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 2 | 0.65 | 0.67 | 1.00 | 0.00 | 1.00 | 0.00 | -654.00 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 35 |
df['NAME_FAMILY_STATUS'].value_counts(normalize = True) * 100
Married 63.88 Single / not married 14.78 Civil marriage 9.68 Separated 6.43 Widow 5.23 Unknown 0.00 Name: NAME_FAMILY_STATUS, dtype: float64
As data looks valid, we will go ahead and check for an imputation method.
Marriedapplicants make up more than 63% of applicants.- Hence, we will go ahead and impute
NAME_FAMILY_STATUSwith 'Married'
df['NAME_FAMILY_STATUS'] = df['NAME_FAMILY_STATUS'].apply(lambda x: 'Married' if x == 'Unknown' else x)
Checking if Unknown is removed
df['NAME_FAMILY_STATUS'].value_counts()
Married 196434 Single / not married 45444 Civil marriage 29775 Separated 19770 Widow 16088 Name: NAME_FAMILY_STATUS, dtype: int64
Analysis of DAYS_EMPLOYED column¶
df['DAYS_EMPLOYED'].value_counts().head()
365243 55374 -200 156 -224 152 -230 151 -199 151 Name: DAYS_EMPLOYED, dtype: int64
df['DAYS_EMPLOYED'].value_counts(normalize = True) * 100
365243 18.01
-200 0.05
-224 0.05
-230 0.05
-199 0.05
...
-11060 0.00
-10409 0.00
-10155 0.00
-11948 0.00
-12341 0.00
Name: DAYS_EMPLOYED, Length: 12574, dtype: float64
len(df[df['DAYS_EMPLOYED'] < 365243])
252137
df[df['DAYS_EMPLOYED'] < 365243].DAYS_EMPLOYED.value_counts()
-200 156
-224 152
-199 151
-230 151
-212 150
...
-12824 1
-8609 1
-8673 1
-6803 1
-9374 1
Name: DAYS_EMPLOYED, Length: 12573, dtype: int64
df['DAYS_EMPLOYED'].unique()
array([ -637, -1188, -225, ..., -12971, -11084, -8694])
df['DAYS_EMPLOYED'].nunique()
12574
Observations
- There are ~55K+ records for which
DAYS_EMPLOYEDis 365243 days - Remaining 252K+ records have negative value for days
- There are 12,574 unique values for
DAYS_EMPLOYED
DAYS_EMPLOYEDcolumn indicates how many days before the application the person started current employment, the applicant/loan officer must have entered negative values to indicate the days before.- We will convert negative values in
DAYS_EMPLOYEDto positive days to standardize the days during use in calculations
df['DAYS_EMPLOYED'] = df['DAYS_EMPLOYED'].apply(lambda x: -x if x < 0 else x)
df['DAYS_EMPLOYED'].value_counts().head()
365243 55374 200 156 224 152 199 151 230 151 Name: DAYS_EMPLOYED, dtype: int64
We can see that all days in DAYS_EMPLOYED have positive values
For ~55K+ records for which DAYS_EMPLOYED is 365243 days
- Converting this to year gives us 1000 years which is physically impossible to be employed for an applicant
- This is present for 18% of the data and cannot be an anamoly
- They could either be
PensionersorUnemployedand looking at the data our conclusion is correct
There are two ways to handle this
- We let the data remain as it is and account for this during analysis OR
- We calculate Average Days Employed excluding this category and impute that instead of 365243 days for Pensioners.
For Unemployed, Days Employed can be 0
Note
- During calculations with this column, we need to account for this scenario as otherwise it will skew our results
df[df['DAYS_EMPLOYED'] == 365243].NAME_INCOME_TYPE.value_counts()
Pensioner 55352 Unemployed 22 Name: NAME_INCOME_TYPE, dtype: int64
Creating a new column YEARS_EMPLOYED for ease of analysis¶
df['YEARS_EMPLOYED'] = df['DAYS_EMPLOYED'].apply(lambda x: round(x/365))
Analysis of DAYS_REGISTRATION column¶
df['DAYS_REGISTRATION'].value_counts().head()
-1.00 113 -7.00 98 -6.00 96 -4.00 92 -2.00 92 Name: DAYS_REGISTRATION, dtype: int64
df['DAYS_REGISTRATION'].value_counts(normalize = True).head()
-1.00 0.00 -7.00 0.00 -6.00 0.00 -4.00 0.00 -2.00 0.00 Name: DAYS_REGISTRATION, dtype: float64
df['DAYS_REGISTRATION'].unique()
array([ -3648., -1186., -4260., ..., -16396., -14558., -14798.])
df['DAYS_REGISTRATION'].nunique()
15688
Converting DAYS_REGISTRATION to positive days
df['DAYS_REGISTRATION'] = df['DAYS_REGISTRATION'].apply(lambda x: -x if x < 0 else x)
df['DAYS_REGISTRATION'].value_counts().head()
1.00 113 7.00 98 6.00 96 2.00 92 4.00 92 Name: DAYS_REGISTRATION, dtype: int64
All Days in DAYS_REGISTRATION have positive values
Creating a new column YEARS_REGISTRATION for ease of analysis¶
df['YEARS_REGISTRATION'] = df['DAYS_REGISTRATION'].apply(lambda x: round(x/365))
Analysis of DAYS_ID_PUBLISH column¶
df['DAYS_ID_PUBLISH'].value_counts().head()
-4053 169 -4095 162 -4046 161 -4417 159 -4256 158 Name: DAYS_ID_PUBLISH, dtype: int64
df['DAYS_ID_PUBLISH'].value_counts(normalize = True).head()
-4053 0.00 -4095 0.00 -4046 0.00 -4417 0.00 -4256 0.00 Name: DAYS_ID_PUBLISH, dtype: float64
df['DAYS_ID_PUBLISH'].unique()
array([-2120, -291, -2531, ..., -6194, -5854, -6211])
df['DAYS_ID_PUBLISH'].nunique()
6168
Converting DAYS_ID_PUBLISH to positive days
df['DAYS_ID_PUBLISH'] = df['DAYS_ID_PUBLISH'].apply(lambda x: -x if x < 0 else x)
df['DAYS_ID_PUBLISH'].value_counts().head()
4053 169 4095 162 4046 161 4417 159 4256 158 Name: DAYS_ID_PUBLISH, dtype: int64
All Days in DAYS_ID_PUBLISH have positive values
Creating a new column YEARS_ID_PUBLISH for ease of analysis¶
df['YEARS_ID_PUBLISH'] = df['DAYS_ID_PUBLISH'].apply(lambda x: round(x/365))
Analysis of DAYS_LAST_PHONE_CHANGE column¶
df['DAYS_LAST_PHONE_CHANGE'].value_counts().head()
0.00 37672 -1.00 2812 -2.00 2318 -3.00 1763 -4.00 1285 Name: DAYS_LAST_PHONE_CHANGE, dtype: int64
df['DAYS_LAST_PHONE_CHANGE'].value_counts(normalize = True).head()
0.00 0.12 -1.00 0.01 -2.00 0.01 -3.00 0.01 -4.00 0.00 Name: DAYS_LAST_PHONE_CHANGE, dtype: float64
df['DAYS_LAST_PHONE_CHANGE'].unique()
array([-1134., -828., -815., ..., -3988., -3899., -3538.])
df['DAYS_LAST_PHONE_CHANGE'].nunique()
3773
Converting DAYS_LAST_PHONE_CHANGE to positive days
df['DAYS_LAST_PHONE_CHANGE'] = df['DAYS_LAST_PHONE_CHANGE'].apply(lambda x: -x if x < 0 else x)
df['DAYS_LAST_PHONE_CHANGE'].value_counts().head()
0.00 37672 1.00 2812 2.00 2318 3.00 1763 4.00 1285 Name: DAYS_LAST_PHONE_CHANGE, dtype: int64
All Days in DAYS_LAST_PHONE_CHANGE have positive values
Creating a new column YEARS_LAST_PHONE_CHANGE for ease of analysis¶
df['YEARS_LAST_PHONE_CHANGE'] = df['DAYS_LAST_PHONE_CHANGE'].apply(lambda x: round(x/365,0))
Automation functions¶
Outlier Analysis (Distplot + BoxPlot) - numerical columns¶
def fn_dist_box(dataset,column):
plt.subplots(1,2 ,figsize = (20,8))
plt.subplot(121)
sns.distplot(dataset[column], color = 'purple')
pltname = 'Distplot of ' + column
plt.ticklabel_format(style='plain', axis='x')
plt.title(pltname)
plt.subplot(122)
# sns.boxplot(y = column, data = dataset, fliersize = 10)
red_diamond = dict(markerfacecolor='r', marker='D')
sns.boxplot(y = column, data = dataset, flierprops = red_diamond)
# plt.boxplot(dataset[column], flierprops = red_diamond)
pltname = 'Boxplot of ' + column
plt.title(pltname)
plt.tight_layout(pad = 4)
plt.show()
Creating a function age_cat to categorize YEARS_BIRTH¶
def age_cat(years):
if years <= 20:
return '0-20'
elif years > 20 and years <= 30:
return '20-30'
elif years > 30 and years <= 40:
return '30-40'
elif years > 40 and years <= 50:
return '40-50'
elif years > 50 and years <= 60:
return '50-60'
elif years > 60 and years <= 70:
return '60-70'
elif years > 70:
return '70+'
Univariate Analysis (Countplot) - categorical columns¶
def fn_uni_countplot(column):
plt.figure(figsize = [20,8])
palt = sns.color_palette("bright")
plt.subplot(1,2,1)
pltname = column + ' of clients with payment difficulties'
plt.title(pltname)
sns.countplot(x = column, data = df1, order = sorted(df1[column].unique(), reverse = True), palette = palt)
plt.xticks(rotation = 90)
plt.subplot(1,2,2)
pltname = column + ' of clients with on-time payments'
plt.title(pltname)
sns.countplot(x = column, data = df0, order = sorted(df0[column].unique(), reverse = True), palette = palt)
plt.xticks(rotation = 90)
plt.tight_layout(pad = 4)
plt.show()
Univariate Analysis (Piechart) - categorical columns¶
def fn_uni_piechart(column):
plt.figure(figsize = [20,12])
plt.subplot(1,2,1)
pltname = column + ' of clients with payment difficulties'
plt.title(pltname)
df1[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['green','yellow','purple','orange','red'], labeldistance=None)
plt.legend()
plt.subplot(1,2,2)
pltname = column + ' of clients with on-time payments'
plt.title(pltname)
df0[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['green','yellow','purple','orange','red'], labeldistance=None)
plt.legend()
plt.tight_layout(pad = 4)
plt.show()
Univariate Analysis (Barplot) - categorical columns¶
def fn_uni_barplot(column):
plt.figure(figsize = [20,8])
plt.subplot(1,2,1)
(df1[column].value_counts(normalize=True)*100).plot.bar(title = column + " Payment difficulties", color=['black', 'red', 'green', 'blue', 'cyan'])
plt.xticks(rotation=90)
plt.subplot(1,2,2)
(df0[column].value_counts(normalize=True)*100).plot.bar(title = column + " On-Time Payments", color=['black', 'red', 'green', 'blue', 'cyan'])
plt.xticks(rotation=90)
plt.tight_layout(pad = 4)
plt.show()
Calculating min and max outlier range for numerical columns¶
def outlier_range(dataset,column):
Q1 = dataset[column].quantile(0.25)
Q3 = dataset[column].quantile(0.75)
IQR = Q3 - Q1
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
return Max_value
Bivariate Analysis (boxplot) - categorical V/S continuous variables¶
def fn_bi_boxplot(categorical,continuous,max_continuous1,max_continuous0,hue_column):
plt.figure(figsize = [20,12])
plt.subplot(1,2,1)
plt.title('Payment Difficulties')
red_diamond = dict(markerfacecolor='r', marker='D')
sns.boxplot(x = categorical,
y = df1[df1[continuous] < max_continuous1][continuous],
data = df1,
flierprops = red_diamond,
order = sorted(df1[categorical].unique(), reverse = True),
hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
plt.ticklabel_format(style='plain', axis='y')
plt.xticks(rotation=90)
plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.boxplot(x = categorical,
y = df0[df0[continuous] < max_continuous0][continuous],
data = df0,
flierprops = red_diamond,
order = sorted(df0[categorical].unique(), reverse = True),
hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
plt.ticklabel_format(style='plain', axis='y')
plt.xticks(rotation=90)
plt.tight_layout(pad = 4)
plt.show()
Bivariate Analysis (Countplot) - categorical V/S categorical columns¶
def fn_bi_countplot(column,hue_column):
plt.figure(figsize = [20,8])
palt = sns.color_palette("bright")
plt.subplot(1,2,1)
pltname = 'Clients with payment difficulties'
plt.title(pltname)
sns.countplot(x = column, data = df1,
order = sorted(df1[column].unique(), reverse = True), palette = palt,
hue = hue_column, hue_order = sorted(df1[hue_column].unique(), reverse = True))
plt.xticks(rotation = 90)
plt.subplot(1,2,2)
pltname = 'Clients with on-time payments'
plt.title(pltname)
sns.countplot(x = column, data = df0,
order = sorted(df0[column].unique(), reverse = True), palette = palt,
hue = hue_column, hue_order = sorted(df0[hue_column].unique(), reverse = True))
plt.xticks(rotation = 90)
plt.tight_layout(pad = 4)
plt.show()
Univariate Analysis (Countplot) - categorical columns - Merged dataset¶
def fn_uni_countplot_merge(column):
plt.figure(figsize = [20,8])
palt = sns.color_palette("bright")
pltname = 'Analysis of ' + column
plt.title(pltname)
sns.countplot(x = column, data = df_merge, order = sorted(df_merge[column].unique(), reverse = True), palette = palt)
plt.xticks(rotation = 90)
plt.tight_layout(pad = 4)
plt.show()
Univariate Analysis (Piechart) - categorical columns - Merged dataset¶
def fn_uni_piechart_merge(column):
plt.figure(figsize = [10,6])
pltname = 'Analysis of ' + column
plt.title(pltname)
df_merge[column].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, startangle=60, colors = ['purple','orange','red','green','yellow','pink'], labeldistance=None)
plt.legend()
plt.tight_layout(pad = 4)
plt.show()
Bivariate Analysis (boxplot) - categorical V/S continuous variables - Merged dataset¶
def fn_bi_boxplot_merge(categorical,continuous,max_continuous,hue_column):
plt.figure(figsize = [20,12])
red_diamond = dict(markerfacecolor='r', marker='D')
sns.boxplot(x = categorical,
y = df_merge[df_merge[continuous] < max_continuous][continuous],
data = df_merge,
flierprops = red_diamond,
order = sorted(df_merge[categorical].unique(), reverse = True),
hue = hue_column, hue_order = sorted(df_merge[hue_column].unique(), reverse = True))
plt.ticklabel_format(style='plain', axis='y')
plt.xticks(rotation=90)
plt.tight_layout(pad = 4)
plt.show()
Bivariate Analysis (Countplot) - categorical V/S categorical columns - Merged dataset¶
def fn_bi_countplot_merge(column,hue_column):
plt.figure(figsize = [20,8])
palt = sns.color_palette("bright")
sns.countplot(x = column, data = df_merge,
order = sorted(df_merge[column].unique()), palette = palt,
hue = hue_column, hue_order = sorted(df_merge[hue_column].unique(), reverse = True))
plt.xticks(rotation = 90)
plt.tight_layout(pad = 4)
plt.show()
Dealing with outliers for numerical columns¶
Analysis of CNT_CHILDREN column¶
df['CNT_CHILDREN'].value_counts().sort_values(ascending = False).head()
0 215371 1 61119 2 26749 3 3717 4 429 Name: CNT_CHILDREN, dtype: int64
(df['CNT_CHILDREN'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
0 70.04 1 19.88 2 8.70 3 1.21 4 0.14 Name: CNT_CHILDREN, dtype: float64
fn_dist_box(df,'CNT_CHILDREN')
Calculating IQR (Inter Quartile range)
Q1 = df['CNT_CHILDREN'].quantile(0.25)
Q3 = df['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
1.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -1.5 Max value after which outlier exist: 2.5
Observations
- Looking at the data, we can see that above 7 children, the count of applicants are very minimal (2 or 3 in each category)
- Also, looking at the observation data for applicants with 10 children, the applicants are only 31 and 41 years respectively. This seems a one-off scenario and can be treated as an outlier
- Both distplots and boxplots clearly show the values above 2.5 as being outliers
Conclusion
- Applicants with 3 or more children are outlier cases
Analysis of AMT_INCOME_TOTAL column¶
df['AMT_INCOME_TOTAL'].value_counts().sort_values(ascending = False).head()
135000.00 35750 112500.00 31019 157500.00 26556 180000.00 24719 90000.00 22483 Name: AMT_INCOME_TOTAL, dtype: int64
(df['AMT_INCOME_TOTAL'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
135000.00 11.63 112500.00 10.09 157500.00 8.64 180000.00 8.04 90000.00 7.31 Name: AMT_INCOME_TOTAL, dtype: float64
df['AMT_INCOME_TOTAL'].describe(percentiles = [0.75,0.99,0.999])
count 307511.00 mean 168797.92 std 237123.15 min 25650.00 50% 147150.00 75% 202500.00 99% 472500.00 99.9% 900000.00 max 117000000.00 Name: AMT_INCOME_TOTAL, dtype: float64
Plotting for AMT_INCOME_TOTAL
fn_dist_box(df,'AMT_INCOME_TOTAL')
- The end result charts are very thin and we are able to observe an outlier around ~120 million.
- Let's plot it out by only considering income under 99.9% value which is 900K
plt.subplots(1,2 ,figsize = (20,8))
plt.subplot(121)
sns.distplot(df[df['AMT_INCOME_TOTAL'] < 900000].AMT_INCOME_TOTAL)
pltname = 'Distplot of ' + 'AMT_INCOME_TOTAL'
plt.title(pltname)
plt.subplot(122)
sns.boxplot(df[df['AMT_INCOME_TOTAL'] < 900000].AMT_INCOME_TOTAL)
pltname = 'Boxplot of ' + 'AMT_INCOME_TOTAL'
plt.title(pltname)
plt.tight_layout(pad = 4)
plt.show()
Now, we are able to clearly make out the distribution and data range in both plots.
- This means that values above 900K income are clearly outliers
df[df['AMT_INCOME_TOTAL'] > 900000].head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEARS_BIRTH | YEARS_EMPLOYED | YEARS_REGISTRATION | YEARS_ID_PUBLISH | YEARS_LAST_PHONE_CHANGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1504 | 101769 | 0 | Revolving loans | M | Y | Y | 0 | 1080000.00 | 180000.00 | 9000.00 | 180000.00 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.07 | 13633 | 5536 | 3.00 | 4971 | 1 | 1 | 0 | 1 | 0 | 0 | Unknown | 2.00 | 1 | 1 | FRIDAY | 13 | 0 | 1 | 1 | 0 | 0 | 0 | Business Entity Type 3 | 0.65 | 0.46 | 0.00 | 0.00 | 0.00 | 0.00 | 491.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 37 | 15 | 0 | 14 | 1.00 |
| 1723 | 102015 | 0 | Cash loans | F | N | Y | 0 | 1935000.00 | 269550.00 | 10534.50 | 225000.00 | Unaccompanied | Pensioner | Secondary / secondary special | Married | House / apartment | 0.01 | 18636 | 365243 | 3401.00 | 2168 | 1 | 0 | 0 | 1 | 1 | 0 | Unknown | 2.00 | 2 | 2 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.27 | 0.33 | 1.00 | 0.00 | 1.00 | 0.00 | 381.00 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 51 | 1001 | 9 | 6 | 1.00 |
| 3371 | 103938 | 0 | Cash loans | F | N | N | 1 | 1350000.00 | 2410380.00 | 109053.00 | 2250000.00 | NaN | Commercial associate | Higher education | Married | House / apartment | 0.07 | 12213 | 843 | 757.00 | 4132 | 1 | 1 | 0 | 1 | 1 | 0 | Managers | 3.00 | 1 | 1 | SUNDAY | 13 | 0 | 1 | 1 | 0 | 0 | 0 | Housing | 0.72 | 0.57 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 3.00 | 0.00 | 0.00 | 33 | 2 | 2 | 11 | 0.00 |
| 4603 | 105384 | 0 | Revolving loans | F | Y | Y | 0 | 1350000.00 | 405000.00 | 20250.00 | 405000.00 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.02 | 14781 | 4768 | 3469.00 | 4045 | 1 | 1 | 0 | 1 | 0 | 0 | Managers | 2.00 | 2 | 2 | TUESDAY | 12 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.66 | 0.73 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 40 | 13 | 10 | 11 | 0.00 |
| 5673 | 106637 | 0 | Cash loans | M | Y | Y | 3 | 967500.00 | 450000.00 | 30073.50 | 450000.00 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.05 | 11901 | 546 | 2932.00 | 215 | 1 | 1 | 0 | 1 | 0 | 0 | Managers | 5.00 | 1 | 1 | MONDAY | 14 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 3 | 0.49 | 0.45 | 0.00 | 0.00 | 0.00 | 0.00 | 88.00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 33 | 1 | 8 | 1 | 0.00 |
Observations
- Looking at the data, we can see that Income above 900K (99.9% value) are outliers
- Both distplots and boxplots clearly show the same trend
Conclusion
- Applicants with Income above 900K (99.9% value) are outliers
Analysis of CNT_FAM_MEMBERS column¶
df['CNT_FAM_MEMBERS'].value_counts().sort_values(ascending = False).head()
2.00 158357 1.00 67847 3.00 52601 4.00 24697 5.00 3478 Name: CNT_FAM_MEMBERS, dtype: int64
(df['CNT_FAM_MEMBERS'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
2.00 51.50 1.00 22.06 3.00 17.11 4.00 8.03 5.00 1.13 Name: CNT_FAM_MEMBERS, dtype: float64
df['CNT_FAM_MEMBERS'].describe(percentiles = [0.75,0.99,0.9999])
count 307509.00 mean 2.15 std 0.91 min 1.00 50% 2.00 75% 3.00 99% 5.00 99.99% 8.00 max 20.00 Name: CNT_FAM_MEMBERS, dtype: float64
Plotting for CNT_FAM_MEMBERS
fn_dist_box(df,'CNT_FAM_MEMBERS')
- For family members count from 5 upwards, the results are sparse and there is a clear outlier around 20
- Applicants with family members above 10 are only 2 or 3
- Let's plot it out by only considering family members under 99.9% value which is 8
plt.subplots(1,2 ,figsize = (20,8))
plt.subplot(121)
sns.distplot(df[df['CNT_FAM_MEMBERS'] <= 8].CNT_FAM_MEMBERS)
pltname = 'Distplot of ' + 'CNT_FAM_MEMBERS'
plt.title(pltname)
plt.subplot(122)
sns.boxplot(df[df['CNT_FAM_MEMBERS'] <= 8].CNT_FAM_MEMBERS)
pltname = 'Boxplot of ' + 'CNT_FAM_MEMBERS'
plt.title(pltname)
plt.tight_layout(pad = 4)
plt.show()
Now, we are able to clearly make out the distribution and data range in both plots.
- This means that applicants with 5 or more family members are clearly outliers
Analysis of AMT_ANNUITY column¶
df['AMT_ANNUITY'].value_counts().sort_values(ascending = False).head()
9000.00 6385 13500.00 5514 6750.00 2279 10125.00 2035 37800.00 1602 Name: AMT_ANNUITY, dtype: int64
(df['AMT_ANNUITY'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
9000.00 2.08 13500.00 1.79 6750.00 0.74 10125.00 0.66 37800.00 0.52 Name: AMT_ANNUITY, dtype: float64
df['AMT_ANNUITY'].describe(percentiles = [0.75,0.99,0.9999])
count 307499.00 mean 27108.57 std 14493.74 min 1615.50 50% 24903.00 75% 34596.00 99% 70006.50 99.99% 213291.00 max 258025.50 Name: AMT_ANNUITY, dtype: float64
Calculating IQR (Inter Quartile range)
Q1 = df['AMT_ANNUITY'].quantile(0.25)
Q3 = df['AMT_ANNUITY'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
18072.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -10584.0 Max value after which outlier exist: 61704.0
Plotting for AMT_ANNUITY
fn_dist_box(df,'AMT_ANNUITY')
- As observed from displot and boxplot, the outliers tend to exist after 61704.0 (This value is derived using
Max_valueusing IQR formulae) - We can verify the same by reproducing the same plots under outlier value as shown below
plt.subplots(1,2 ,figsize = (20,8))
plt.subplot(121)
sns.distplot(df[df['AMT_ANNUITY'] <= 61704.0].AMT_ANNUITY)
pltname = 'Distplot of ' + 'AMT_ANNUITY'
plt.title(pltname)
plt.subplot(122)
sns.boxplot(df[df['AMT_ANNUITY'] <= 61704.0].AMT_ANNUITY)
pltname = 'Boxplot of ' + 'AMT_ANNUITY'
plt.title(pltname)
plt.tight_layout(pad = 4)
plt.show()
Now, we are able to clearly make out the distribution and data range in both plots
- This means that values above 61704.0
AMT_ANNUITYare clearly outliers
Observations
- As observed from displot and boxplot, the outliers tend to exist after 61704 (Outlier value is derived using
Max_valueusing IQR formulae)
Conclusion
- Applicants with
AMT_ANNUITYabove 61704 (calculated using IQR) are outliers
Analysis of AMT_CREDIT column¶
df['AMT_CREDIT'].value_counts().sort_values(ascending = False).head()
450000.00 9709 675000.00 8877 225000.00 8162 180000.00 7342 270000.00 7241 Name: AMT_CREDIT, dtype: int64
(df['AMT_CREDIT'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
450000.00 3.16 675000.00 2.89 225000.00 2.65 180000.00 2.39 270000.00 2.35 Name: AMT_CREDIT, dtype: float64
df['AMT_CREDIT'].describe(percentiles = [0.75,0.99,0.9999])
count 307511.00 mean 599026.00 std 402490.78 min 45000.00 50% 513531.00 75% 808650.00 99% 1854000.00 99.99% 3030530.76 max 4050000.00 Name: AMT_CREDIT, dtype: float64
Calculating IQR (Inter Quartile range)
Q1 = df['AMT_CREDIT'].quantile(0.25)
Q3 = df['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
538650.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -537975.0 Max value after which outlier exist: 1616625.0
Plotting for AMT_CREDIT
fn_dist_box(df,'AMT_CREDIT')
- As observed from displot and boxplot, the outliers tend to exist after 1616625.0 (This value is derived using
Max_valueusing IQR formulae) - We can verify the same by reproducing the same plots under outlier value as shown below
plt.subplots(1,2 ,figsize = (20,8))
plt.subplot(121)
sns.distplot(df[df['AMT_CREDIT'] <= 1616625.0].AMT_CREDIT)
pltname = 'Distplot of ' + 'AMT_CREDIT'
plt.title(pltname)
plt.subplot(122)
sns.boxplot(df[df['AMT_CREDIT'] <= 1616625.0].AMT_CREDIT)
pltname = 'Boxplot of ' + 'AMT_CREDIT'
plt.title(pltname)
plt.tight_layout(pad = 4)
plt.show()
Now, we are able to clearly make out the distribution and data range in both plots
- This means that values above 61616625.0
AMT_CREDITare clearly outliers
Observations
- As observed from displot and boxplot, the outliers tend to exist after 1616625.0 (Outlier value is derived using
Max_valueusing IQR formulae)
Conclusion
- Applicants with
AMT_CREDITabove 1616625.0 (calculated using IQR) are outliers
Analysis of DAYS_LAST_PHONE_CHANGE column¶
df['DAYS_LAST_PHONE_CHANGE'].value_counts().sort_values(ascending = False).head()
0.00 37672 1.00 2812 2.00 2318 3.00 1763 4.00 1285 Name: DAYS_LAST_PHONE_CHANGE, dtype: int64
(df['DAYS_LAST_PHONE_CHANGE'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
0.00 12.25 1.00 0.91 2.00 0.75 3.00 0.57 4.00 0.42 Name: DAYS_LAST_PHONE_CHANGE, dtype: float64
df['DAYS_LAST_PHONE_CHANGE'].describe(percentiles = [0.75,0.99,0.9999])
count 307510.00 mean 962.86 std 826.81 min 0.00 50% 757.00 75% 1570.00 99% 3149.00 99.99% 3999.75 max 4292.00 Name: DAYS_LAST_PHONE_CHANGE, dtype: float64
Calculating IQR (Inter Quartile range)
Q1 = df['DAYS_LAST_PHONE_CHANGE'].quantile(0.25)
Q3 = df['DAYS_LAST_PHONE_CHANGE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
1296.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -1670.0 Max value after which outlier exist: 3514.0
Plotting for DAYS_LAST_PHONE_CHANGE
fn_dist_box(df,'DAYS_LAST_PHONE_CHANGE')
- As observed from displot and boxplot, the outliers tend to exist after 3514.0 (This value is derived using
Max_valueusing IQR formulae) - We can verify the same by reproducing the same plots under outlier value as shown below
plt.subplots(1,2 ,figsize = (20,8))
plt.subplot(121)
sns.distplot(df[df['DAYS_LAST_PHONE_CHANGE'] <= 3514.0].DAYS_LAST_PHONE_CHANGE)
pltname = 'Distplot of ' + 'DAYS_LAST_PHONE_CHANGE'
plt.title(pltname)
plt.subplot(122)
sns.boxplot(df[df['DAYS_LAST_PHONE_CHANGE'] <= 3514.0].DAYS_LAST_PHONE_CHANGE)
pltname = 'Boxplot of ' + 'DAYS_LAST_PHONE_CHANGE'
plt.title(pltname)
plt.tight_layout(pad = 4)
plt.show()
Now, we are able to clearly make out the distribution and data range in both plots
- This means that values above 3514.0
DAYS_LAST_PHONE_CHANGEare clearly outliers
Observations
- As observed from displot and boxplot, the outliers tend to exist after 3514.0 (Outlier value is derived using
Max_valueusing IQR formulae)
Conclusion
- Applicants with
DAYS_LAST_PHONE_CHANGEabove 3514.0 (calculated using IQR) are outliers
Binning of continuous columns for analysis¶
Categorizing AMT_GOODS_PRICE column¶
df['AMT_GOODS_PRICE'].value_counts().sort_values(ascending = False).head()
450000.00 26022 225000.00 25282 675000.00 24962 900000.00 15416 270000.00 11428 Name: AMT_GOODS_PRICE, dtype: int64
(df['AMT_GOODS_PRICE'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
450000.00 8.47 225000.00 8.23 675000.00 8.12 900000.00 5.02 270000.00 3.72 Name: AMT_GOODS_PRICE, dtype: float64
Getting Statistical summary for AMT_GOODS_PRICE
df['AMT_GOODS_PRICE'].describe(percentiles = [0.25,0.75,0.99,0.9999])
count 307233.00 mean 538396.21 std 369446.46 min 40500.00 25% 238500.00 50% 450000.00 75% 679500.00 99% 1800000.00 99.99% 2934964.80 max 4050000.00 Name: AMT_GOODS_PRICE, dtype: float64
Categorize the values in AMT_GOODS_PRICE into 5 bins and create a new column AMT_GOODS_PRICE_CATEGORY
df['AMT_GOODS_PRICE_CATEGORY'] = pd.cut(df['AMT_GOODS_PRICE'], bins = 5, labels = ['very low','low','medium','high','very high'])
Checking that the values are populated as per expectation
df['AMT_GOODS_PRICE_CATEGORY'].value_counts()
very low 246973 low 56087 medium 4108 high 47 very high 18 Name: AMT_GOODS_PRICE_CATEGORY, dtype: int64
Categorizing YEARS_BIRTH column¶
We will be categorizing 'YEARS_BIRTH' instead of 'DAYS_BIRTH' as years are easier to interpret than days
df['YEARS_BIRTH'].value_counts().sort_values(ascending = False).head()
39 9023 38 8793 37 8727 40 8602 41 8495 Name: YEARS_BIRTH, dtype: int64
(df['YEARS_BIRTH'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
39 2.93 38 2.86 37 2.84 40 2.80 41 2.76 Name: YEARS_BIRTH, dtype: float64
Getting Statistical summary for YEARS_BIRTH
df['YEARS_BIRTH'].describe(percentiles = [0.25,0.75,0.99,0.9999])
count 307511.00 mean 43.94 std 11.96 min 21.00 25% 34.00 50% 43.00 75% 54.00 99% 67.00 99.99% 69.00 max 69.00 Name: YEARS_BIRTH, dtype: float64
Categorize the values from YEARS_BIRTH into a new column YEARS_BIRTH_CATEGORY
df['YEARS_BIRTH_CATEGORY'] = df['YEARS_BIRTH'].apply(age_cat)
Checking that the values are populated as per expectation
df['YEARS_BIRTH_CATEGORY'].value_counts().sort_values(ascending = False)
30-40 82770 40-50 75509 50-60 67955 20-30 48869 60-70 32408 Name: YEARS_BIRTH_CATEGORY, dtype: int64
Categorizing YEARS_REGISTRATION column¶
We will be categorizing 'YEARS_REGISTRATION' instead of 'DAYS_REGISTRATION' as years are easier to interpret than days
df['YEARS_REGISTRATION'].value_counts().sort_values(ascending = False).head()
1 15661 2 15564 3 13894 13 12866 12 12697 Name: YEARS_REGISTRATION, dtype: int64
(df['YEARS_REGISTRATION'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
1 5.09 2 5.06 3 4.52 13 4.18 12 4.13 Name: YEARS_REGISTRATION, dtype: float64
Getting Statistical summary for YEARS_REGISTRATION
df['YEARS_REGISTRATION'].describe(percentiles = [0.25,0.75,0.99,0.9999])
count 307511.00 mean 13.66 std 9.66 min 0.00 25% 6.00 50% 12.00 75% 20.00 99% 38.00 99.99% 54.25 max 68.00 Name: YEARS_REGISTRATION, dtype: float64
Categorize the values from YEARS_REGISTRATION into a new column YEARS_REGISTRATION_CATEGORY
df['YEARS_REGISTRATION_CATEGORY'] = df['YEARS_REGISTRATION'].apply(age_cat)
Checking that the values are populated as per expectation
df['YEARS_REGISTRATION_CATEGORY'].value_counts().sort_values(ascending = False)
0-20 230697 20-30 59288 30-40 16051 40-50 1406 50-60 60 60-70 9 Name: YEARS_REGISTRATION_CATEGORY, dtype: int64
Checking Imbalance for target column `TARGET'¶
Analyzing TARGET column¶
df['TARGET'].value_counts().sort_values(ascending = False)
0 282686 1 24825 Name: TARGET, dtype: int64
df['TARGET'].value_counts(normalize = True).sort_values(ascending = False) * 100
0 91.93 1 8.07 Name: TARGET, dtype: float64
plt.figure(figsize = (10,5))
sns.countplot(x = df['TARGET'], data = df)
plt.title('Checking imbalance ratio of TARGET variable')
plt.show()
plt.figure(figsize = [20,8])
(df['TARGET'].value_counts(normalize=True)*100).plot.bar(title = "Checking imbalance percentage for TARGET variable", color=['black', 'red', 'green', 'blue', 'cyan'])
plt.xticks(rotation=0)
plt.show()
Observations
- We have imbalance in
TARGETvariable based on the % of observations TARGETvalue 1 represents client with payment difficulties (he/she had late payment more than X days on at least one of the first Y installments of the loan). This is only 8.07% of the dataTARGETvalue 0 represents all other cases than 1. This is 91.93% of the data
Data split based on TARGET¶
Create new dataframe with TARGET value 1¶
TARGETvalue 1 represents client with payment difficulties (he/she had late payment more than X days on at least one of the first Y installments of the loan). This is only 8.07% of the data
df1 = df[df['TARGET'] == 1]
df1.TARGET.value_counts()
1 24825 Name: TARGET, dtype: int64
Create new dataframe with TARGET value 0¶
TARGETvalue 0 represents all other cases than 1. This is 91.93% of the data
df0 = df[df['TARGET'] == 0]
df0.TARGET.value_counts()
0 282686 Name: TARGET, dtype: int64
Univariate analysis of categorical variables¶
Analysis of NAME_CONTRACT_TYPE¶
NAME_CONTRACT_TYPE with payment difficulties
df1['NAME_CONTRACT_TYPE'].value_counts().sort_values(ascending = False)
Cash loans 23221 Revolving loans 1604 Name: NAME_CONTRACT_TYPE, dtype: int64
NAME_CONTRACT_TYPE with on-time payments
df0['NAME_CONTRACT_TYPE'].value_counts().sort_values(ascending = False)
Cash loans 255011 Revolving loans 27675 Name: NAME_CONTRACT_TYPE, dtype: int64
fn_uni_countplot('NAME_CONTRACT_TYPE')
fn_uni_piechart('NAME_CONTRACT_TYPE')
Observations
- Looking at CountPlot, we don't see significant differences in
NAME_CONTRACT_TYPEb/w clients with payment difficulties and on-time payments - Looking at PieChart, we don't see significant differences in
NAME_CONTRACT_TYPEb/w clients with payment difficulties and on-time payments
Conclusion
NAME_CONTRACT_TYPEcolumn does not provide any conclusive evidence in favor of clients with payment difficulties OR on-time payments
Analysis of CODE_GENDER¶
CODE_GENDER with payment difficulties
df1['CODE_GENDER'].value_counts().sort_values(ascending = False)
F 14170 M 10655 Name: CODE_GENDER, dtype: int64
CODE_GENDER with on-time payments
df0['CODE_GENDER'].value_counts().sort_values(ascending = False)
F 188282 M 94404 Name: CODE_GENDER, dtype: int64
sorted(df['CODE_GENDER'].unique(), reverse = True)
['M', 'F']
fn_uni_countplot('CODE_GENDER')
fn_uni_piechart('CODE_GENDER')
Observations
- Looking at CountPlot and PieChart, there is a 9.5% decrease in "Male" values from
CODE_GENDERb/w clients with payment difficulties to on-time payments. It's a weak correlation that Males have more payment difficulties.
Conclusion
CODE_GENDERcolumn provides a weak inference that "Male" clients have more payment difficulties
Analysis of FLAG_OWN_CAR¶
FLAG_OWN_CAR with payment difficulties
df1['FLAG_OWN_CAR'].value_counts().sort_values(ascending = False)
N 17249 Y 7576 Name: FLAG_OWN_CAR, dtype: int64
FLAG_OWN_CAR with on-time payments
df0['FLAG_OWN_CAR'].value_counts().sort_values(ascending = False)
N 185675 Y 97011 Name: FLAG_OWN_CAR, dtype: int64
fn_uni_countplot('FLAG_OWN_CAR')
fn_uni_piechart('FLAG_OWN_CAR')
Observations
- Looking at CountPlot, we don't see significant differences in
FLAG_OWN_CARb/w clients with payment difficulties and on-time payments - Looking at PieChart, we don't see significant differences in
FLAG_OWN_CARb/w clients with payment difficulties and on-time payments.
Conclusion
FLAG_OWN_CARcolumn does not provide any conclusive evidence in favor of clients with payment difficulties OR on-time payments
Analysis of NAME_INCOME_TYPE¶
NAME_INCOME_TYPE with payment difficulties
df1['NAME_INCOME_TYPE'].value_counts().sort_values(ascending = False)
Working 15224 Commercial associate 5360 Pensioner 2982 State servant 1249 Unemployed 8 Maternity leave 2 Name: NAME_INCOME_TYPE, dtype: int64
NAME_INCOME_TYPE with on-time payments
df0['NAME_INCOME_TYPE'].value_counts().sort_values(ascending = False)
Working 143550 Commercial associate 66257 Pensioner 52380 State servant 20454 Student 18 Unemployed 14 Businessman 10 Maternity leave 3 Name: NAME_INCOME_TYPE, dtype: int64
fn_uni_countplot('NAME_INCOME_TYPE')
fn_uni_barplot('NAME_INCOME_TYPE')
Observations
- Pensioners have better on-time payments
- Students don't have Payment difficulties
- Businessmen don't have Payment difficulties
Conclusion
- Pensioners have better on-time payments. This is a weak correlation.
- Students don't have Payment difficulties. In this case, total students have only 18 observations and should be treated as a weak correlation
- Businessmen don't have Payment difficulties. In this case, Businessmen have only 10 observations and should be treated as a weak correlation
Analysis of NAME_EDUCATION_TYPE¶
NAME_EDUCATION_TYPE with payment difficulties
df1['NAME_EDUCATION_TYPE'].value_counts().sort_values(ascending = False)
Secondary / secondary special 19524 Higher education 4009 Incomplete higher 872 Lower secondary 417 Academic degree 3 Name: NAME_EDUCATION_TYPE, dtype: int64
NAME_EDUCATION_TYPE with on-time payments
df0['NAME_EDUCATION_TYPE'].value_counts().sort_values(ascending = False)
Secondary / secondary special 198867 Higher education 70854 Incomplete higher 9405 Lower secondary 3399 Academic degree 161 Name: NAME_EDUCATION_TYPE, dtype: int64
fn_uni_countplot('NAME_EDUCATION_TYPE')
fn_uni_barplot('NAME_EDUCATION_TYPE')
fn_uni_piechart('NAME_EDUCATION_TYPE')
Observations
- Clients with 'Higher education' have better on-timepayments than payment difficulties
- Remaining categories don't provide any conclusive results
Conclusion
- Clients with 'Higher education' have less payment difficulties. However, this is a weak correlation
Analysis of NAME_FAMILY_STATUS¶
NAME_FAMILY_STATUS with payment difficulties
df1['NAME_FAMILY_STATUS'].value_counts().sort_values(ascending = False)
Married 14850 Single / not married 4457 Civil marriage 2961 Separated 1620 Widow 937 Name: NAME_FAMILY_STATUS, dtype: int64
NAME_FAMILY_STATUS with on-time payments
df0['NAME_FAMILY_STATUS'].value_counts().sort_values(ascending = False)
Married 181584 Single / not married 40987 Civil marriage 26814 Separated 18150 Widow 15151 Name: NAME_FAMILY_STATUS, dtype: int64
fn_uni_countplot('NAME_FAMILY_STATUS')
fn_uni_barplot('NAME_FAMILY_STATUS')
fn_uni_piechart('NAME_FAMILY_STATUS')
Observations
- Clients who are 'Married' are 59.8% with payment difficulties and 64.2% with on-timepayments
- Clients who are 'Widow' are 3.8% with payment difficulties and 5.4% with on-timepayments
- Clients who are 'Single/not married' are 18.0% with payment difficulties and 14.5% with on-timepayments
- Remaining categories don't provide any conclusive results
Conclusion
- Clients who are 'Married' OR 'Widow' do on-time payments better comparatively. However, this is a weak correlation.
- Clients who are 'Single/not married' have more difficulties with on-time payments comparatively. However, this is a weak correlation.
Correlation analysis of numerical variables¶
Plotting correlation matrix for Payment Difficulties¶
df1.select_dtypes(include=["int64","float64"]).shape
(24825, 66)
There are 66 numerical columns. Creating a correlation matrix corr to view the results better
corr_df1 = df1.select_dtypes(include=["int64","float64"]).corr()
corr_df1.head()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEARS_BIRTH | YEARS_EMPLOYED | YEARS_REGISTRATION | YEARS_ID_PUBLISH | YEARS_LAST_PHONE_CHANGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 1.00 | NaN | -0.01 | -0.01 | -0.00 | -0.01 | -0.00 | 0.01 | 0.00 | -0.01 | -0.01 | 0.00 | NaN | 0.01 | 0.00 | -0.00 | 0.02 | 0.00 | -0.00 | -0.01 | -0.00 | 0.01 | -0.00 | 0.00 | 0.00 | 0.01 | 0.00 | -0.00 | 0.01 | -0.01 | -0.01 | -0.01 | -0.01 | -0.01 | -0.00 | -0.00 | -0.00 | NaN | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | NaN | -0.00 | NaN | -0.00 | 0.00 | 0.01 | -0.00 | -0.01 | -0.01 | -0.01 | 0.00 | 0.01 | -0.01 | -0.01 | -0.00 | 0.01 | -0.00 | 0.01 | 0.00 | -0.01 | -0.01 | 0.00 | -0.00 |
| TARGET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CNT_CHILDREN | -0.01 | NaN | 1.00 | 0.00 | -0.00 | 0.03 | -0.01 | -0.03 | -0.26 | -0.19 | -0.15 | 0.03 | NaN | 0.19 | 0.05 | 0.00 | -0.02 | 0.01 | 0.89 | 0.04 | 0.04 | -0.02 | -0.02 | -0.02 | -0.01 | -0.00 | 0.05 | 0.05 | -0.01 | -0.02 | 0.03 | 0.00 | 0.03 | -0.01 | -0.01 | -0.01 | 0.06 | NaN | -0.02 | -0.13 | -0.00 | 0.03 | 0.00 | NaN | -0.01 | NaN | 0.00 | -0.00 | 0.00 | -0.00 | -0.01 | 0.00 | -0.00 | -0.00 | 0.00 | -0.00 | -0.01 | -0.01 | -0.01 | -0.02 | -0.04 | -0.26 | -0.19 | -0.15 | 0.03 | -0.01 |
| AMT_INCOME_TOTAL | -0.01 | NaN | 0.00 | 1.00 | 0.04 | 0.05 | 0.04 | 0.01 | -0.00 | -0.01 | -0.00 | 0.00 | NaN | 0.01 | -0.01 | -0.00 | -0.00 | 0.01 | 0.01 | -0.02 | -0.02 | 0.01 | 0.01 | 0.01 | 0.01 | -0.00 | -0.00 | -0.00 | 0.01 | -0.02 | -0.00 | -0.01 | -0.00 | -0.00 | 0.00 | 0.00 | 0.00 | NaN | -0.00 | -0.01 | -0.00 | 0.02 | 0.00 | NaN | -0.00 | NaN | 0.01 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | -0.00 | -0.00 | -0.00 | 0.00 | -0.00 | 0.00 | 0.00 | -0.00 | 0.00 | -0.00 | -0.01 | 0.00 | 0.00 | 0.00 |
| AMT_CREDIT | -0.00 | NaN | -0.00 | 0.04 | 1.00 | 0.75 | 0.98 | 0.07 | 0.14 | 0.00 | 0.03 | 0.05 | NaN | -0.00 | -0.06 | 0.03 | 0.01 | 0.01 | 0.05 | -0.06 | -0.07 | 0.03 | 0.02 | 0.03 | 0.03 | -0.03 | -0.04 | -0.02 | 0.12 | 0.08 | 0.02 | -0.03 | 0.02 | -0.03 | 0.11 | 0.03 | 0.06 | NaN | -0.01 | -0.01 | -0.01 | 0.06 | 0.03 | NaN | 0.04 | NaN | 0.02 | 0.06 | 0.03 | 0.05 | -0.00 | 0.01 | 0.01 | 0.02 | -0.02 | -0.01 | 0.00 | 0.01 | 0.06 | -0.02 | -0.04 | 0.14 | 0.00 | 0.03 | 0.05 | 0.11 |
Creating a HeatMap to view the correlations above 80% and 99.99%
plt.figure(figsize = (25,25))
sns.heatmap(data = corr_df1[(corr_df1 >= 0.8) & (corr_df1 < 0.9999)], annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()
Getting top 10 correlations for Payment Difficulties¶
corr_df1[corr_df1 <= 0.99].unstack().sort_values(ascending = False).head(22)
AMT_GOODS_PRICE AMT_CREDIT 0.98 AMT_CREDIT AMT_GOODS_PRICE 0.98 REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.96 REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY 0.96 CNT_FAM_MEMBERS CNT_CHILDREN 0.89 CNT_CHILDREN CNT_FAM_MEMBERS 0.89 DEF_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 0.87 DEF_30_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE 0.87 LIVE_REGION_NOT_WORK_REGION REG_REGION_NOT_WORK_REGION 0.85 REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION 0.85 LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.78 REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY 0.78 AMT_ANNUITY AMT_GOODS_PRICE 0.75 AMT_GOODS_PRICE AMT_ANNUITY 0.75 AMT_ANNUITY AMT_CREDIT 0.75 AMT_CREDIT AMT_ANNUITY 0.75 FLAG_DOCUMENT_6 YEARS_EMPLOYED 0.62 YEARS_EMPLOYED FLAG_DOCUMENT_6 0.62 DAYS_EMPLOYED FLAG_DOCUMENT_6 0.62 FLAG_DOCUMENT_6 DAYS_EMPLOYED 0.62 DAYS_EMPLOYED DAYS_BIRTH 0.58 DAYS_BIRTH DAYS_EMPLOYED 0.58 dtype: float64
As we have duplicate combinations, looking at the above and removing dups, we get top 10 correlations as:
- AMT_GOODS_PRICE AMT_CREDIT 0.98
- REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY 0.96
- CNT_FAM_MEMBERS CNT_CHILDREN 0.89
- DEF_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 0.87
- REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION 0.85
- LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.78
- AMT_ANNUITY AMT_GOODS_PRICE 0.75
- AMT_ANNUITY AMT_CREDIT 0.75
- DAYS_EMPLOYED FLAG_DOCUMENT_6 0.62
- DAYS_BIRTH DAYS_EMPLOYED 0.58
Plotting correlation matrix for On-Time payments¶
df0.select_dtypes(include=["int64","float64"]).shape
(282686, 66)
There are 66 numerical columns. Creating a correlation matrix corr to view the results better
corr_df0 = df0.select_dtypes(include=["int64","float64"]).corr()
corr_df0.head()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEARS_BIRTH | YEARS_EMPLOYED | YEARS_REGISTRATION | YEARS_ID_PUBLISH | YEARS_LAST_PHONE_CHANGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 1.00 | NaN | -0.00 | 0.00 | -0.00 | 0.00 | -0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | -0.00 | -0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | 0.00 | 0.00 | 0.00 | -0.00 | 0.00 | -0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | -0.00 | 0.00 | -0.00 | 0.00 | 0.00 | -0.00 | -0.00 | -0.00 | 0.00 | -0.00 | 0.00 | -0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | -0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| TARGET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CNT_CHILDREN | -0.00 | NaN | 1.00 | 0.03 | 0.00 | 0.02 | -0.00 | -0.02 | -0.34 | -0.25 | -0.19 | 0.03 | 0.00 | 0.24 | 0.06 | -0.00 | -0.03 | 0.02 | 0.88 | 0.02 | 0.02 | -0.01 | -0.01 | 0.01 | 0.02 | 0.02 | 0.07 | 0.07 | -0.02 | -0.04 | 0.01 | -0.00 | 0.01 | -0.00 | 0.01 | 0.00 | 0.06 | -0.00 | -0.02 | -0.16 | -0.00 | 0.05 | -0.00 | -0.00 | -0.01 | 0.00 | 0.00 | -0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | 0.00 | -0.00 | -0.01 | -0.01 | -0.04 | -0.34 | -0.25 | -0.19 | 0.03 | 0.01 |
| AMT_INCOME_TOTAL | 0.00 | NaN | 0.03 | 1.00 | 0.34 | 0.42 | 0.35 | 0.17 | -0.06 | -0.14 | -0.06 | -0.02 | 0.00 | 0.14 | -0.03 | -0.02 | 0.00 | 0.08 | 0.03 | -0.19 | -0.20 | 0.08 | 0.07 | 0.14 | 0.13 | 0.01 | 0.02 | 0.02 | 0.14 | -0.07 | -0.03 | -0.03 | -0.03 | -0.03 | 0.04 | -0.00 | -0.04 | 0.00 | 0.00 | -0.10 | 0.01 | 0.16 | 0.04 | 0.00 | 0.01 | 0.01 | 0.05 | 0.05 | 0.02 | 0.02 | 0.01 | 0.01 | 0.01 | 0.00 | -0.00 | 0.00 | 0.01 | 0.01 | 0.06 | 0.01 | 0.03 | -0.06 | -0.14 | -0.06 | -0.02 | 0.04 |
| AMT_CREDIT | -0.00 | NaN | 0.00 | 0.34 | 1.00 | 0.77 | 0.99 | 0.10 | 0.05 | -0.07 | -0.01 | 0.00 | 0.00 | 0.07 | -0.02 | 0.02 | 0.03 | 0.02 | 0.06 | -0.10 | -0.11 | 0.05 | 0.02 | 0.05 | 0.05 | -0.03 | -0.02 | 0.00 | 0.13 | 0.04 | -0.00 | -0.02 | -0.00 | -0.02 | 0.07 | 0.01 | 0.10 | 0.00 | -0.01 | -0.05 | -0.00 | 0.08 | 0.02 | -0.00 | 0.03 | 0.00 | 0.05 | 0.05 | 0.03 | 0.06 | 0.01 | 0.04 | 0.02 | 0.03 | -0.02 | -0.00 | 0.00 | -0.00 | 0.05 | 0.02 | -0.05 | 0.05 | -0.07 | -0.01 | 0.00 | 0.07 |
Creating a HeatMap to view the correlations above 80% and 99.99%
plt.figure(figsize = (25,25))
sns.heatmap(data = corr_df0[(corr_df0 >= 0.8) & (corr_df0 < 0.9999)], annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()
Getting top 10 correlations for On-Time Payments¶
corr_df0[corr_df0 <= 0.99].unstack().sort_values(ascending = False).head(28)
AMT_CREDIT AMT_GOODS_PRICE 0.99
AMT_GOODS_PRICE AMT_CREDIT 0.99
REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.95
REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY 0.95
CNT_CHILDREN CNT_FAM_MEMBERS 0.88
CNT_FAM_MEMBERS CNT_CHILDREN 0.88
REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION 0.86
LIVE_REGION_NOT_WORK_REGION REG_REGION_NOT_WORK_REGION 0.86
DEF_60_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE 0.86
DEF_30_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE 0.86
REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY 0.83
LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.83
AMT_ANNUITY AMT_GOODS_PRICE 0.78
AMT_GOODS_PRICE AMT_ANNUITY 0.78
AMT_CREDIT AMT_ANNUITY 0.77
AMT_ANNUITY AMT_CREDIT 0.77
DAYS_EMPLOYED DAYS_BIRTH 0.63
DAYS_BIRTH DAYS_EMPLOYED 0.63
YEARS_EMPLOYED 0.63
YEARS_EMPLOYED DAYS_BIRTH 0.63
DAYS_EMPLOYED YEARS_BIRTH 0.63
YEARS_BIRTH DAYS_EMPLOYED 0.63
YEARS_EMPLOYED YEARS_BIRTH 0.63
YEARS_BIRTH YEARS_EMPLOYED 0.63
FLAG_DOCUMENT_6 YEARS_EMPLOYED 0.60
YEARS_EMPLOYED FLAG_DOCUMENT_6 0.60
FLAG_DOCUMENT_6 DAYS_EMPLOYED 0.60
DAYS_EMPLOYED FLAG_DOCUMENT_6 0.60
dtype: float64
As we have duplicate combinations, looking at the above and removing dups, we get top 10 correlations as:
- AMT_GOODS_PRICE AMT_CREDIT 0.99
- REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY 0.95
- CNT_FAM_MEMBERS CNT_CHILDREN 0.88
- REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION 0.86
- DEF_30_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE 0.86
- LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.83
- AMT_ANNUITY AMT_GOODS_PRICE 0.78
- AMT_ANNUITY AMT_CREDIT 0.77
- DAYS_BIRTH DAYS_EMPLOYED 0.63
- DAYS_EMPLOYED FLAG_DOCUMENT_6 0.60
Comparing top 10 correlations b/w Payment difficulties and On-Time Payments¶
Observations
- The top 10 correlations for Payment difficulties and On-Time Payments are the same except minor differences in correlation percentage
- The highest correlation is for the combination of
AMT_GOODS_PRICEandAMT_CREDIT - For Payment difficulties dataset, the correlation b/w
AMT_GOODS_PRICEandAMT_CREDITis 0.98 - For On-Time Payments dataset, the correlation b/w
AMT_GOODS_PRICEandAMT_CREDITis 0.99
Univariate analysis of numerical variables¶
Analysis of AMT_CREDIT¶
Outlier identification of AMT_CREDIT with Payment difficulties¶
Calculating IQR (Inter Quartile range)
Q1 = df1['AMT_CREDIT'].quantile(0.25)
Q3 = df1['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
448915.5
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -388973.25 Max value after which outlier exist: 1406688.75
Outlier identification of AMT_CREDIT with On-Time Payments¶
Calculating IQR (Inter Quartile range)
Q1 = df0['AMT_CREDIT'].quantile(0.25)
Q3 = df0['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
540000.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -540000.0 Max value after which outlier exist: 1620000.0
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_CREDIT'] <= Max_value1].AMT_CREDIT,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_CREDIT'] <= Max_value0].AMT_CREDIT,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
Observations
- For
AMT_CREDITbetween 250000 and approximately 650000, there are more clients with Payment difficulties - For
AMT_CREDIT> 750000 , there are more clients with On-Time Payments
Analysis of YEARS_BIRTH¶
Outlier identification of YEARS_BIRTH with Payment difficulties¶
Calculating IQR (Inter Quartile range)
Q1 = df1['YEARS_BIRTH'].quantile(0.25)
Q3 = df1['YEARS_BIRTH'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
18.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: 4.0 Max value after which outlier exist: 76.0
Outlier identification of YEARS_BIRTH with On-Time Payments¶
Calculating IQR (Inter Quartile range)
Q1 = df0['YEARS_BIRTH'].quantile(0.25)
Q3 = df0['YEARS_BIRTH'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
20.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: 4.0 Max value after which outlier exist: 84.0
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['YEARS_BIRTH'] <= Max_value1].YEARS_BIRTH,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['YEARS_BIRTH'] <= Max_value0].YEARS_BIRTH,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
Observations
- For
YEARS_BIRTHbetween 20 and 40, there are more clients with Payment difficulties - Conversely, for
YEARS_BIRTH> 40 , there are more clients with On-Time Payments
Analysis of AMT_GOODS_PRICE ¶
Outlier identification of AMT_GOODS_PRICE with Payment difficulties¶
Calculating IQR (Inter Quartile range)
Q1 = df1['AMT_GOODS_PRICE'].quantile(0.25)
Q3 = df1['AMT_GOODS_PRICE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
436500.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -416250.0 Max value after which outlier exist: 1329750.0
Outlier identification of AMT_GOODS_PRICE with On-Time Payments¶
Calculating IQR (Inter Quartile range)
Q1 = df0['AMT_GOODS_PRICE'].quantile(0.25)
Q3 = df0['AMT_GOODS_PRICE'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
450000.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -436500.0 Max value after which outlier exist: 1363500.0
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_GOODS_PRICE'] <= Max_value1].AMT_GOODS_PRICE,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_GOODS_PRICE'] <= Max_value0].AMT_GOODS_PRICE,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
Observations
- For
AMT_GOODS_PRICEbetween ~250000 and ~550000, there are more clients with Payment difficulties - Otherwise there are spikes on and off but they don't show any conclusive observations
Analysis of DAYS_EMPLOYED¶
Outlier identification of DAYS_EMPLOYED with Payment difficulties¶
Calculating IQR (Inter Quartile range)
Q1 = df1['DAYS_EMPLOYED'].quantile(0.25)
Q3 = df1['DAYS_EMPLOYED'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
2603.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -3227.5 Max value after which outlier exist: 7184.5
Outlier identification of DAYS_EMPLOYED with On-Time Payments¶
Calculating IQR (Inter Quartile range)
Q1 = df0['DAYS_EMPLOYED'].quantile(0.25)
Q3 = df0['DAYS_EMPLOYED'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
5107.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -6693.5 Max value after which outlier exist: 13734.5
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['DAYS_EMPLOYED'] <= Max_value1].DAYS_EMPLOYED,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['DAYS_EMPLOYED'] <= Max_value0].DAYS_EMPLOYED,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
Observations
- For
DAYS_EMPLOYEDless than 2000, there are more clients with Payment difficulties - Conversely, for
DAYS_EMPLOYED> 2000 , there are more clients with On-Time Payments - This means that those who are employed longer have better chances of repaying the loan
Analysis of CNT_CHILDREN ¶
Outlier identification of CNT_CHILDREN with Payment difficulties¶
Calculating IQR (Inter Quartile range)
Q1 = df1['CNT_CHILDREN'].quantile(0.25)
Q3 = df1['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
1.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -1.5 Max value after which outlier exist: 2.5
Outlier identification of CNT_CHILDREN with On-Time Payments¶
Calculating IQR (Inter Quartile range)
Q1 = df0['CNT_CHILDREN'].quantile(0.25)
Q3 = df0['CNT_CHILDREN'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
1.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -1.5 Max value after which outlier exist: 2.5
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['CNT_CHILDREN'] <= Max_value1].CNT_CHILDREN,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['CNT_CHILDREN'] <= Max_value0].CNT_CHILDREN,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
Observations
- For
CNT_CHILDREN0 (those with no children), there are lots of clients with On-Time Payments - For
CNT_CHILDRENwith 1 OR 2 (those with 1 or 2 children), there are few more clients with On-Time Payments
Analysis of AMT_INCOME_TOTAL ¶
Outlier identification of AMT_INCOME_TOTAL with Payment difficulties¶
Calculating IQR (Inter Quartile range)
Q1 = df1['AMT_INCOME_TOTAL'].quantile(0.25)
Q3 = df1['AMT_INCOME_TOTAL'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
90000.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value1 = (Q1 - 1.5 * IQR)
Max_value1 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value1))
print("Max value after which outlier exist: {}".format(Max_value1))
Min value before which outlier exist: -22500.0 Max value after which outlier exist: 337500.0
Outlier identification of AMT_INCOME_TOTAL with On-Time Payments¶
Calculating IQR (Inter Quartile range)
Q1 = df0['AMT_INCOME_TOTAL'].quantile(0.25)
Q3 = df0['AMT_INCOME_TOTAL'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
90000.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value0 = (Q1 - 1.5 * IQR)
Max_value0 = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value0))
print("Max value after which outlier exist: {}".format(Max_value0))
Min value before which outlier exist: -22500.0 Max value after which outlier exist: 337500.0
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df1[df1['AMT_INCOME_TOTAL'] <= Max_value1].AMT_INCOME_TOTAL,label = 'Payment difficulties', hist=False)
sns.distplot(df0[df0['AMT_INCOME_TOTAL'] <= Max_value0].AMT_INCOME_TOTAL,label = 'On-Time Payments', hist=False)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
Observations
- Based on
AMT_INCOME_TOTAL, for clients with Payment difficulties, the distribution resembles a normal distribution approximately - But for clients with On-Time Payments, there are erratic spikes in the distribution which doesn't give any valid observations
Bivariate/Multivariate analysis¶
Continuous V/S Continuous variables¶
Analysis of AMT_GOODS_PRICE V/S AMT_CREDIT¶
Outlier identification of AMT_GOODS_PRICE with Payment difficulties
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE
1329750.0
Outlier identification of AMT_CREDIT with Payment difficulties
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT
1406688.75
Outlier identification of AMT_GOODS_PRICE with On-Time Payments
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE
1363500.0
Outlier identification of AMT_CREDIT with On-Time Payments
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT
1620000.0
Plotting scatterplot for comparison with outliers removed
plt.figure(figsize = [20,8])
plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_GOODS_PRICE'] < max_value1_AMT_GOODS_PRICE].AMT_GOODS_PRICE,
y = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_GOODS_PRICE'] < max_value0_AMT_GOODS_PRICE].AMT_GOODS_PRICE,
y = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
AMT_GOODS_PRICEandAMT_CREDIThave strong positive correlation. This means that as Goods price increases, so does Credit Amount
Analysis of AMT_ANNUITY V/S AMT_CREDIT¶
Outlier identification of AMT_ANNUITY with Payment difficulties
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY
56398.5
Outlier identification of AMT_CREDIT with Payment difficulties
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT
1406688.75
Outlier identification of AMT_ANNUITY with On-Time Payments
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY
62187.75
Outlier identification of AMT_CREDIT with On-Time Payments
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT
1620000.0
Plotting scatterplot for comparison with outliers removed
plt.figure(figsize = [20,8])
plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_ANNUITY'] < max_value1_AMT_ANNUITY].AMT_ANNUITY,
y = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_ANNUITY'] < max_value0_AMT_ANNUITY].AMT_ANNUITY,
y = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
AMT_ANNUITYandAMT_CREDIThave strong positive correlation. This means that as Annuity Amount increases, so does Credit Amount
Analysis of DAYS_EMPLOYED V/S AMT_INCOME_TOTAL¶
Outlier identification of DAYS_EMPLOYED with Payment difficulties
max_value1_DAYS_EMPLOYED = outlier_range(df1,'DAYS_EMPLOYED')
max_value1_DAYS_EMPLOYED
7184.5
Outlier identification of AMT_INCOME_TOTAL with Payment difficulties
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
337500.0
Outlier identification of DAYS_EMPLOYED with On-Time Payments
max_value0_DAYS_EMPLOYED = outlier_range(df0,'DAYS_EMPLOYED')
max_value0_DAYS_EMPLOYED
13734.5
Outlier identification of AMT_INCOME_TOTAL with On-Time Payments
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
337500.0
Plotting scatterplot for comparison with outliers removed
plt.figure(figsize = [20,8])
plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['DAYS_EMPLOYED'] < max_value1_DAYS_EMPLOYED].DAYS_EMPLOYED,
y = df1[df1['AMT_INCOME_TOTAL'] < max_value1_AMT_INCOME_TOTAL].AMT_INCOME_TOTAL, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['DAYS_EMPLOYED'] < max_value0_DAYS_EMPLOYED].DAYS_EMPLOYED,
y = df0[df0['AMT_INCOME_TOTAL'] < max_value0_AMT_INCOME_TOTAL].AMT_INCOME_TOTAL, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
- Clients who are employed for a long time (>7000) days are making their payments on-time but these category of clients do not exist in Payments difficulties group
- Even looking at Payment difficulties group, clients with more than 4000 days of employment are sparse
Analysis of AMT_CREDIT V/S DAYS_BIRTH¶
Outlier identification of AMT_CREDIT with Payment difficulties
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT
1406688.75
Outlier identification of DAYS_BIRTH with Payment difficulties
max_value1_DAYS_BIRTH = outlier_range(df1,'DAYS_BIRTH')
max_value1_DAYS_BIRTH
27998.5
Outlier identification of AMT_CREDIT with On-Time Payments
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT
1620000.0
Outlier identification of DAYS_BIRTH with On-Time Payments
max_value0_DAYS_BIRTH = outlier_range(df0,'DAYS_BIRTH')
max_value0_DAYS_BIRTH
30680.375
Plotting scatterplot for comparison with outliers removed
plt.figure(figsize = [20,8])
plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_CREDIT'] < max_value1_AMT_CREDIT].AMT_CREDIT,
y = df1[df1['DAYS_BIRTH'] < max_value1_DAYS_BIRTH].DAYS_BIRTH, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_CREDIT'] < max_value0_AMT_CREDIT].AMT_CREDIT,
y = df0[df0['DAYS_BIRTH'] < max_value0_DAYS_BIRTH].DAYS_BIRTH, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
- There is no observable correlation between Days of Birth and Amount of Credit
Analysis of AMT_ANNUITY V/S AMT_GOODS_PRICE¶
Outlier identification of AMT_ANNUITY with Payment difficulties
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY
56398.5
Outlier identification of AMT_GOODS_PRICE with Payment difficulties
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE
1329750.0
Outlier identification of AMT_ANNUITY with On-Time Payments
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY
62187.75
Outlier identification of AMT_GOODS_PRICE with On-Time Payments
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE
1363500.0
Plotting scatterplot for comparison with outliers removed
plt.figure(figsize = [20,8])
plt.subplot(1,2,1)
plt.title('Payment difficulties')
sns.scatterplot(x = df1[df1['AMT_ANNUITY'] < max_value1_AMT_ANNUITY].AMT_ANNUITY,
y = df1[df1['AMT_GOODS_PRICE'] < max_value1_AMT_GOODS_PRICE].AMT_GOODS_PRICE, data = df1)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.subplot(1,2,2)
plt.title('On-Time Payments')
sns.scatterplot(x = df0[df0['AMT_ANNUITY'] < max_value0_AMT_ANNUITY].AMT_ANNUITY,
y = df0[df0['AMT_GOODS_PRICE'] < max_value0_AMT_GOODS_PRICE].AMT_GOODS_PRICE, data = df0)
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
AMT_ANNUITYandAMT_GOODS_PRICEhave strong positive correlation. This means that as Annuity increases, so does Goods Price
Continuous V/S Categorical variables¶
Analysis of NAME_EDUCATION_TYPE V/S AMT_CREDIT V/S CODE_GENDER¶
Outlier identification of AMT_CREDIT with Payment difficulties
max_value1_AMT_CREDIT = outlier_range(df1,'AMT_CREDIT')
max_value1_AMT_CREDIT
1406688.75
Outlier identification of AMT_CREDIT with On-Time Payments
max_value0_AMT_CREDIT = outlier_range(df0,'AMT_CREDIT')
max_value0_AMT_CREDIT
1620000.0
Client with Payment difficulties
df1.groupby(by = ['NAME_EDUCATION_TYPE','CODE_GENDER']).AMT_CREDIT.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_EDUCATION_TYPE | CODE_GENDER | ||||||||
| Academic degree | F | 3.00 | 950245.50 | 504711.38 | 544491.00 | 667660.50 | 790830.00 | 1153122.75 | 1515415.50 |
| Higher education | F | 2438.00 | 648500.58 | 410415.70 | 47970.00 | 323595.00 | 545040.00 | 877500.00 | 2687355.00 |
| M | 1571.00 | 642070.94 | 409695.93 | 45000.00 | 329202.00 | 545040.00 | 832797.00 | 2695500.00 | |
| Incomplete higher | F | 504.00 | 520150.35 | 352227.49 | 91692.00 | 269550.00 | 450000.00 | 675000.00 | 2695500.00 |
| M | 368.00 | 554723.82 | 347843.71 | 74628.00 | 284400.00 | 497520.00 | 699976.12 | 2013840.00 |
Client with On-Time Payments
df0.groupby(by = ['NAME_EDUCATION_TYPE','CODE_GENDER']).AMT_CREDIT.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_EDUCATION_TYPE | CODE_GENDER | ||||||||
| Academic degree | F | 97.00 | 708770.74 | 463052.42 | 130824.00 | 354469.50 | 607500.00 | 1006920.00 | 2463840.00 |
| M | 64.00 | 735235.38 | 497739.97 | 74182.50 | 309519.00 | 610056.00 | 1036890.00 | 1984500.00 | |
| Higher education | F | 47852.00 | 682485.00 | 456642.78 | 45000.00 | 302206.50 | 568057.50 | 916470.00 | 4050000.00 |
| M | 23002.00 | 713144.54 | 486073.66 | 45000.00 | 315000.00 | 592560.00 | 973503.00 | 4050000.00 | |
| Incomplete higher | F | 5780.00 | 555340.53 | 385980.60 | 45000.00 | 253737.00 | 454500.00 | 760225.50 | 2517300.00 |
fn_bi_boxplot('NAME_EDUCATION_TYPE','AMT_CREDIT',max_value1_AMT_CREDIT,max_value0_AMT_CREDIT,'CODE_GENDER')
Observations
- Clients with
Academic Degreehave a wide range of credits for On-Time Payments whereas the range is much lower for ones with Payment difficulties - Looking at summary statistics, Clients with
Academic Degreeand Payment difficulties take mean and median credit at a much higher range than On-Time Payment clients Maleclients withAcademic Degreealways pay the loan on-time
Analysis of NAME_FAMILY_STATUS V/S AMT_INCOME_TOTAL V/S CODE_GENDER¶
Outlier identification of AMT_INCOME_TOTAL with Payment difficulties
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
337500.0
Outlier identification of AMT_INCOME_TOTAL with On-Time Payments
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
337500.0
Client with Payment difficulties
df1.groupby(by = ['NAME_FAMILY_STATUS','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_FAMILY_STATUS | CODE_GENDER | ||||||||
| Civil marriage | F | 1753.00 | 153620.72 | 101045.86 | 25650.00 | 108000.00 | 135000.00 | 180000.00 | 3150000.00 |
| M | 1208.00 | 173949.99 | 92449.17 | 36000.00 | 117000.00 | 157500.00 | 202500.00 | 1350000.00 | |
| Married | F | 8324.00 | 159907.85 | 1283283.40 | 27000.00 | 90000.00 | 135000.00 | 180000.00 | 117000000.00 |
| M | 6526.00 | 182233.06 | 94787.77 | 31500.00 | 126000.00 | 157500.00 | 225000.00 | 1890000.00 | |
| Separated | F | 1045.00 | 161159.55 | 82633.33 | 36000.00 | 112500.00 | 135000.00 | 202500.00 | 832500.00 |
Client with On-Time Payments
df0.groupby(by = ['NAME_FAMILY_STATUS','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_FAMILY_STATUS | CODE_GENDER | ||||||||
| Civil marriage | F | 19017.00 | 162086.05 | 106811.49 | 27000.00 | 112500.00 | 135000.00 | 202500.00 | 9000000.00 |
| M | 7797.00 | 184126.29 | 106559.49 | 27000.00 | 121500.00 | 157500.00 | 225000.00 | 3825000.00 | |
| Married | F | 114125.00 | 154289.53 | 88593.43 | 26100.00 | 94500.00 | 135000.00 | 180000.00 | 3600000.00 |
| M | 67459.00 | 199592.83 | 147599.76 | 27000.00 | 135000.00 | 180000.00 | 225000.00 | 18000090.00 | |
| Separated | F | 14416.00 | 166374.21 | 94782.18 | 27000.00 | 112500.00 | 139500.00 | 202500.00 | 2250000.00 |
fn_bi_boxplot('NAME_FAMILY_STATUS','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')
Observations
Marriedclients have a slightly higher mean/median income with On-Time Payments than Payment difficulties category
Analysis of YEARS_BIRTH_CATEGORY V/S AMT_INCOME_TOTAL V/S NAME_HOUSING_TYPE¶
Outlier identification of AMT_INCOME_TOTAL with Payment difficulties
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
337500.0
Outlier identification of AMT_INCOME_TOTAL with On-Time Payments
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
337500.0
Client with Payment difficulties
df1.groupby(by = ['YEARS_BIRTH_CATEGORY','NAME_HOUSING_TYPE']).AMT_INCOME_TOTAL.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| YEARS_BIRTH_CATEGORY | NAME_HOUSING_TYPE | ||||||||
| 20-30 | Co-op apartment | 40.00 | 148628.70 | 59516.24 | 67500.00 | 110250.00 | 135000.00 | 185625.00 | 360000.00 |
| House / apartment | 4201.00 | 151897.22 | 70874.65 | 31500.00 | 112500.00 | 135000.00 | 180000.00 | 810000.00 | |
| Municipal apartment | 153.00 | 153608.82 | 78282.79 | 45000.00 | 108000.00 | 135000.00 | 180000.00 | 675000.00 | |
| Office apartment | 40.00 | 147375.00 | 50817.26 | 67500.00 | 112500.00 | 135000.00 | 180000.00 | 270000.00 | |
| Rented apartment | 229.00 | 144210.10 | 59998.66 | 45000.00 | 99000.00 | 135000.00 | 180000.00 | 405000.00 |
Client with On-Time Payments
df0.groupby(by = ['YEARS_BIRTH_CATEGORY','NAME_HOUSING_TYPE']).AMT_INCOME_TOTAL.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| YEARS_BIRTH_CATEGORY | NAME_HOUSING_TYPE | ||||||||
| 20-30 | Co-op apartment | 298.00 | 161153.15 | 76171.26 | 45000.00 | 112500.00 | 139500.00 | 202500.00 | 720000.00 |
| House / apartment | 33165.00 | 163944.39 | 92313.18 | 27000.00 | 112500.00 | 141264.00 | 202500.00 | 4500000.00 | |
| Municipal apartment | 1099.00 | 166235.08 | 88698.47 | 45000.00 | 112500.00 | 144000.00 | 202500.00 | 1417500.00 | |
| Office apartment | 455.00 | 177960.18 | 115765.13 | 36000.00 | 112500.00 | 157500.00 | 225000.00 | 1800000.00 | |
| Rented apartment | 1535.00 | 153028.06 | 70748.99 | 29250.00 | 109125.00 | 135000.00 | 180000.00 | 742500.00 |
fn_bi_boxplot('YEARS_BIRTH_CATEGORY','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'NAME_HOUSING_TYPE')
Observations
- Clients with age
60-70and living inCo-op apartmenthave very high income range in Payment difficulties category than On-Time Payments - Clients with age
20-30and living inOffice apartmenthave very higher income median in On-Time Payments compared to Payment difficulties category
Analysis of FLAG_OWN_CAR V/S AMT_ANNUITY V/S CODE_GENDER¶
Outlier identification of AMT_ANNUITY with Payment difficulties
max_value1_AMT_ANNUITY = outlier_range(df1,'AMT_ANNUITY')
max_value1_AMT_ANNUITY
56398.5
Outlier identification of AMT_ANNUITY with On-Time Payments
max_value0_AMT_ANNUITY = outlier_range(df0,'AMT_ANNUITY')
max_value0_AMT_ANNUITY
62187.75
Client with Payment difficulties
df1.groupby(by = ['FLAG_OWN_CAR','CODE_GENDER']).AMT_ANNUITY.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| FLAG_OWN_CAR | CODE_GENDER | ||||||||
| N | F | 11567.00 | 24910.53 | 11855.56 | 2722.50 | 16350.75 | 23715.00 | 31261.50 | 149211.00 |
| M | 5682.00 | 26258.84 | 12075.78 | 3172.50 | 17517.38 | 25202.25 | 32352.75 | 127507.50 | |
| Y | F | 2603.00 | 28577.91 | 13256.17 | 3559.50 | 18969.75 | 26797.50 | 35961.75 | 128178.00 |
| M | 4973.00 | 29293.82 | 13130.94 | 4374.00 | 20007.00 | 27558.00 | 36459.00 | 115803.00 |
Client with On-Time Payments
df0.groupby(by = ['FLAG_OWN_CAR','CODE_GENDER']).AMT_ANNUITY.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| FLAG_OWN_CAR | CODE_GENDER | ||||||||
| N | F | 145917.00 | 25351.26 | 13579.03 | 1615.50 | 15448.50 | 23283.00 | 32125.50 | 230161.50 |
| M | 39749.00 | 26799.80 | 14295.32 | 1980.00 | 16294.50 | 24750.00 | 34114.50 | 225000.00 | |
| Y | F | 42357.00 | 29830.22 | 15705.98 | 1993.50 | 18409.50 | 27085.50 | 38133.00 | 225000.00 |
| M | 54651.00 | 30200.48 | 15988.66 | 2317.50 | 19125.00 | 27630.00 | 38322.00 | 258025.50 |
fn_bi_boxplot('FLAG_OWN_CAR','AMT_ANNUITY',max_value1_AMT_ANNUITY,max_value0_AMT_ANNUITY,'CODE_GENDER')
Observations
- We don't find any significant observations
Analysis of NAME_INCOME_TYPE V/S AMT_GOODS_PRICE V/S CODE_GENDER¶
Outlier identification of AMT_GOODS_PRICE with Payment difficulties
max_value1_AMT_GOODS_PRICE = outlier_range(df1,'AMT_GOODS_PRICE')
max_value1_AMT_GOODS_PRICE
1329750.0
Outlier identification of AMT_GOODS_PRICE with On-Time Payments
max_value0_AMT_GOODS_PRICE = outlier_range(df0,'AMT_GOODS_PRICE')
max_value0_AMT_GOODS_PRICE
1363500.0
Client with Payment difficulties
df1.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_GOODS_PRICE.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_INCOME_TYPE | CODE_GENDER | ||||||||
| Commercial associate | F | 2966.00 | 536362.21 | 353710.65 | 45000.00 | 270000.00 | 450000.00 | 675000.00 | 3600000.00 |
| M | 2391.00 | 526387.31 | 335859.94 | 45000.00 | 270000.00 | 450000.00 | 675000.00 | 2961000.00 | |
| Maternity leave | F | 2.00 | 929250.00 | 715945.62 | 423000.00 | 676125.00 | 929250.00 | 1182375.00 | 1435500.00 |
| Pensioner | F | 2241.00 | 495016.26 | 317042.45 | 45000.00 | 229500.00 | 450000.00 | 675000.00 | 2173500.00 |
| M | 739.00 | 486543.08 | 321104.43 | 45000.00 | 229500.00 | 450000.00 | 675000.00 | 1800000.00 |
Client with On-Time Payments
df0.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_GOODS_PRICE.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_INCOME_TYPE | CODE_GENDER | ||||||||
| Businessman | F | 3.00 | 1125000.00 | 1031079.53 | 225000.00 | 562500.00 | 900000.00 | 1575000.00 | 2250000.00 |
| M | 7.00 | 1272857.14 | 755104.06 | 495000.00 | 607500.00 | 1350000.00 | 1800000.00 | 2250000.00 | |
| Commercial associate | F | 41522.00 | 602242.63 | 400824.95 | 45000.00 | 270000.00 | 472500.00 | 837000.00 | 4050000.00 |
| M | 24691.00 | 620220.21 | 422445.53 | 45000.00 | 270000.00 | 472500.00 | 900000.00 | 4050000.00 | |
| Maternity leave | F | 2.00 | 562500.00 | 350017.86 | 315000.00 | 438750.00 | 562500.00 | 686250.00 | 810000.00 |
fn_bi_boxplot('NAME_INCOME_TYPE','AMT_GOODS_PRICE',max_value1_AMT_GOODS_PRICE,max_value0_AMT_GOODS_PRICE,'CODE_GENDER')
Observations
- Clients who are
UnemployedandMalehave a very high price of goods in On-Time Payments than Payment difficulties - Clients who are
Studentand eitherMaleORFemaledo their payments On-Time. They are completely missing from Payment difficulties category.Studentseems to be an attractive category to give loans to. - Clients who are
Businessmanand eitherMaleORFemaledo their payments On-Time. They are completely missing from Payment difficulties category.Businessmanseems to be an attractive category to give loans to.
Analysis of NAME_INCOME_TYPE V/S AMT_INCOME_TOTAL V/S CODE_GENDER¶
Outlier identification of AMT_INCOME_TOTAL with Payment difficulties
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
337500.0
Outlier identification of AMT_INCOME_TOTAL with On-Time Payments
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
337500.0
Client with Payment difficulties
df1.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_INCOME_TYPE | CODE_GENDER | ||||||||
| Commercial associate | F | 2968.00 | 172962.54 | 94922.97 | 36000.00 | 112500.00 | 157500.00 | 202500.00 | 1575000.00 |
| M | 2392.00 | 207145.51 | 120281.01 | 45000.00 | 135000.00 | 180000.00 | 225000.00 | 1890000.00 | |
| Maternity leave | F | 2.00 | 58500.00 | 12727.92 | 49500.00 | 54000.00 | 58500.00 | 63000.00 | 67500.00 |
| Pensioner | F | 2243.00 | 133275.59 | 71580.63 | 25650.00 | 90000.00 | 117000.00 | 157500.00 | 1260000.00 |
| M | 739.00 | 142481.26 | 76941.74 | 31500.00 | 90000.00 | 135000.00 | 180000.00 | 585000.00 |
Client with On-Time Payments
df0.groupby(by = ['NAME_INCOME_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_INCOME_TYPE | CODE_GENDER | ||||||||
| Businessman | F | 3.00 | 555000.00 | 337749.91 | 225000.00 | 382500.00 | 540000.00 | 720000.00 | 900000.00 |
| M | 7.00 | 694285.71 | 738446.73 | 180000.00 | 202500.00 | 450000.00 | 787500.00 | 2250000.00 | |
| Commercial associate | F | 41552.00 | 187223.56 | 108354.86 | 26550.00 | 117000.00 | 157500.00 | 225000.00 | 3600000.00 |
| M | 24705.00 | 232612.57 | 199602.65 | 36000.00 | 157500.00 | 202500.00 | 270000.00 | 18000090.00 | |
| Maternity leave | F | 2.00 | 112500.00 | 31819.81 | 90000.00 | 101250.00 | 112500.00 | 123750.00 | 135000.00 |
fn_bi_boxplot('NAME_INCOME_TYPE','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')
Observations
- Clients who are
UnemployedandMalehave a very high income in On-Time Payments than Payment difficulties - Clients who are
Studentand eitherMaleORFemaledo their payments On-Time. They are completely missing from Payment difficulties category.Studentseems to be an attractive category to give loans to. - Clients who are
Businessmanand eitherMaleORFemaledo their payments On-Time. They are completely missing from Payment difficulties category.Businessmanseems to be an attractive category to give loans to. - Clients who are in
Maternity LeaveandFemalehave a very high income in On-Time Payments than Payment difficulties
Analysis of OCCUPATION_TYPE V/S AMT_INCOME_TOTAL V/S CODE_GENDER¶
Outlier identification of AMT_INCOME_TOTAL with Payment difficulties
max_value1_AMT_INCOME_TOTAL = outlier_range(df1,'AMT_INCOME_TOTAL')
max_value1_AMT_INCOME_TOTAL
337500.0
Outlier identification of AMT_INCOME_TOTAL with On-Time Payments
max_value0_AMT_INCOME_TOTAL = outlier_range(df0,'AMT_INCOME_TOTAL')
max_value0_AMT_INCOME_TOTAL
337500.0
Client with Payment difficulties
df1.groupby(by = ['OCCUPATION_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| OCCUPATION_TYPE | CODE_GENDER | ||||||||
| Accountants | F | 457.00 | 174840.64 | 79021.04 | 54000.00 | 112500.00 | 157500.00 | 202500.00 | 540000.00 |
| M | 17.00 | 202367.65 | 98623.66 | 81000.00 | 121500.00 | 180000.00 | 252000.00 | 438750.00 | |
| Cleaning staff | F | 410.00 | 128907.99 | 70328.23 | 31500.00 | 90000.00 | 112500.00 | 157500.00 | 720000.00 |
| M | 37.00 | 119493.24 | 45786.21 | 45000.00 | 90000.00 | 112500.00 | 135000.00 | 247500.00 | |
| Cooking staff | F | 555.00 | 129298.38 | 55531.88 | 32400.00 | 90000.00 | 112500.00 | 157500.00 | 405000.00 |
Client with On-Time Payments
df0.groupby(by = ['OCCUPATION_TYPE','CODE_GENDER']).AMT_INCOME_TOTAL.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| OCCUPATION_TYPE | CODE_GENDER | ||||||||
| Accountants | F | 9072.00 | 194324.51 | 109135.61 | 27000.00 | 130500.00 | 175500.00 | 225000.00 | 2214117.00 |
| M | 267.00 | 236490.89 | 131795.33 | 54000.00 | 141750.00 | 216000.00 | 270000.00 | 1035000.00 | |
| Cleaning staff | F | 3904.00 | 130602.05 | 63495.05 | 25650.00 | 90000.00 | 112500.00 | 157500.00 | 1080000.00 |
| M | 302.00 | 137172.52 | 64373.09 | 31500.00 | 90000.00 | 132750.00 | 169875.00 | 450000.00 | |
| Cooking staff | F | 4830.00 | 134460.40 | 60213.56 | 27000.00 | 90000.00 | 121500.00 | 157500.00 | 540000.00 |
fn_bi_boxplot('OCCUPATION_TYPE','AMT_INCOME_TOTAL',max_value1_AMT_INCOME_TOTAL,max_value0_AMT_INCOME_TOTAL,'CODE_GENDER')
Observations
- Clients who are
Waiters/barment staffandfemalehave less median income in On-Time Payments than Payment difficulties - Clients who are
Cleaning staffandfemalehave more median income in On-Time Payments than Payment difficulties - Clients who are
HR StaffandMalehave more median income in Payment difficulties than On-Time Payments - Clients who are
ManagersandMalehave more median income in On-Time Payments than Payment difficulties
Categorical V/S Categorical variables¶
Analysis of NAME_INCOME_TYPE V/S CODE_GENDER¶
fn_bi_countplot('NAME_INCOME_TYPE','CODE_GENDER')
Observations
- Clients who are
WorkingandMalehave more Payment difficulties compared to On-Time Payments - Clients who are
PensionerandFemalehave more Payment difficulties compared to On-Time Payments - Clients who are
BusinessmanandStudentsdo their payments On-Time though their record count is low
Analysis of NAME_EDUCATION_TYPE V/S CODE_GENDER¶
fn_bi_countplot('NAME_EDUCATION_TYPE','CODE_GENDER')
Observations
- Clients who have
Secondary/Secondary specialeducation andMalehave more Payment difficulties compared to On-Time Payments - Clients who have
Higher educationandFemalehave more On-Time Payments compared to Payment difficulties
Analysis of NAME_FAMILY_STATUS V/S OCCUPATION_TYPE¶
fn_bi_countplot('OCCUPATION_TYPE','NAME_FAMILY_STATUS')
Observations
- Clients who are
Single/not married,Married&Civil marriageand areWaiters/barmen staffhave more Payment difficulties compared to On-Time Payments - Clients who are
Single/not married&Marriedand areLaborershave more Payment difficulties compared to On-Time Payments - Clients who are
Marriedand areDrivershave more Payment difficulties compared to On-Time Payments MarriedandAccountantshave better On-Time Payments
Analysis of ORGANIZATION_TYPE V/S FLAG_OWN_CAR¶
fn_bi_countplot('ORGANIZATION_TYPE','FLAG_OWN_CAR')
Observations
- Clients who are
Self-employedand don't ownCarhave more Payment difficulties compared to On-Time Payments
Analysis of OCCUPATION_TYPE V/S NAME_CONTRACT_TYPE¶
fn_bi_countplot('OCCUPATION_TYPE','NAME_CONTRACT_TYPE')
Observations
- Clients who are
Sales staff,Laborers,Driversand haveCash loanshave more Payment difficulties compared to On-Time Payments
Previous Application Data¶
Importing the dataset¶
df_prev = pd.read_csv("Previous_application.csv")
# Checking few records from the dataframe
df_prev.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.43 | 17145.00 | 17145.00 | 0.00 | 17145.00 | SATURDAY | 15 | Y | 1 | 0.00 | 0.18 | 0.87 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.00 | middle | POS mobile with interest | 365243.00 | -42.00 | 300.00 | -42.00 | -37.00 | 0.00 |
| 1 | 2802425 | 108129 | Cash loans | 25188.62 | 607500.00 | 679671.00 | NaN | 607500.00 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.00 | low_action | Cash X-Sell: low | 365243.00 | -134.00 | 916.00 | 365243.00 | 365243.00 | 1.00 |
| 2 | 2523466 | 122040 | Cash loans | 15060.74 | 112500.00 | 136444.50 | NaN | 112500.00 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.00 | high | Cash X-Sell: high | 365243.00 | -271.00 | 59.00 | 365243.00 | 365243.00 | 1.00 |
| 3 | 2819243 | 176158 | Cash loans | 47041.33 | 450000.00 | 470790.00 | NaN | 450000.00 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.00 | middle | Cash X-Sell: middle | 365243.00 | -482.00 | -152.00 | -182.00 | -177.00 | 1.00 |
| 4 | 1784265 | 202054 | Cash loans | 31924.40 | 337500.00 | 404055.00 | NaN | 337500.00 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.00 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
Check structure of the data¶
df_prev.info(verbose = True,null_counts = True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB
We do not see any columns with Nullable values
df_prev.shape
(1670214, 37)
There are ~1.67 million rows and 37 columns
Get statistical summary for numerical variables¶
df_prev.describe()
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1670214.00 | 1670214.00 | 1297979.00 | 1670214.00 | 1670213.00 | 774370.00 | 1284699.00 | 1670214.00 | 1670214.00 | 774370.00 | 5951.00 | 5951.00 | 1670214.00 | 1670214.00 | 1297984.00 | 997149.00 | 997149.00 | 997149.00 | 997149.00 | 997149.00 | 997149.00 |
| mean | 1923089.14 | 278357.17 | 15955.12 | 175233.86 | 196114.02 | 6697.40 | 227847.28 | 12.48 | 1.00 | 0.08 | 0.19 | 0.77 | -880.68 | 313.95 | 16.05 | 342209.86 | 13826.27 | 33767.77 | 76582.40 | 81992.34 | 0.33 |
| std | 532597.96 | 102814.82 | 14782.14 | 292779.76 | 318574.62 | 20921.50 | 315396.56 | 3.33 | 0.06 | 0.11 | 0.09 | 0.10 | 779.10 | 7127.44 | 14.57 | 88916.12 | 72444.87 | 106857.03 | 149647.42 | 153303.52 | 0.47 |
| min | 1000001.00 | 100001.00 | 0.00 | 0.00 | 0.00 | -0.90 | 0.00 | 0.00 | 0.00 | -0.00 | 0.03 | 0.37 | -2922.00 | -1.00 | 0.00 | -2922.00 | -2892.00 | -2801.00 | -2889.00 | -2874.00 | 0.00 |
| 25% | 1461857.25 | 189329.00 | 6321.78 | 18720.00 | 24160.50 | 0.00 | 50841.00 | 10.00 | 1.00 | 0.00 | 0.16 | 0.72 | -1300.00 | -1.00 | 6.00 | 365243.00 | -1628.00 | -1242.00 | -1314.00 | -1270.00 | 0.00 |
| 50% | 1923110.50 | 278714.50 | 11250.00 | 71046.00 | 80541.00 | 1638.00 | 112320.00 | 12.00 | 1.00 | 0.05 | 0.19 | 0.84 | -581.00 | 3.00 | 12.00 | 365243.00 | -831.00 | -361.00 | -537.00 | -499.00 | 0.00 |
| 75% | 2384279.75 | 367514.00 | 20658.42 | 180360.00 | 216418.50 | 7740.00 | 234000.00 | 15.00 | 1.00 | 0.11 | 0.19 | 0.85 | -280.00 | 82.00 | 24.00 | 365243.00 | -411.00 | 129.00 | -74.00 | -44.00 | 1.00 |
| max | 2845382.00 | 456255.00 | 418058.15 | 6905160.00 | 6905160.00 | 3060045.00 | 6905160.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | -1.00 | 4000000.00 | 84.00 | 365243.00 | 365243.00 | 365243.00 | 365243.00 | 365243.00 | 1.00 |
Analyzing categorical variables¶
df_prev.select_dtypes(include = "object").columns
Index(['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON',
'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY',
'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE',
'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'],
dtype='object')
# Checking number of categorical variables
len(df_prev.select_dtypes(include = "object").columns)
16
There are 16 categorical variables
Analyzing numerical variables¶
df_prev.select_dtypes(include=["int64","float64"]).columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'AMT_ANNUITY', 'AMT_APPLICATION',
'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'HOUR_APPR_PROCESS_START', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA',
'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
# Checking number of categorical variables
len(df_prev.select_dtypes(include=["int64","float64"]).columns)
21
There are 21 numerical variables
df_prev.select_dtypes(include=["int64","float64"]).head()
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | 1730.43 | 17145.00 | 17145.00 | 0.00 | 17145.00 | 15 | 1 | 0.00 | 0.18 | 0.87 | -73 | 35 | 12.00 | 365243.00 | -42.00 | 300.00 | -42.00 | -37.00 | 0.00 |
| 1 | 2802425 | 108129 | 25188.62 | 607500.00 | 679671.00 | NaN | 607500.00 | 11 | 1 | NaN | NaN | NaN | -164 | -1 | 36.00 | 365243.00 | -134.00 | 916.00 | 365243.00 | 365243.00 | 1.00 |
| 2 | 2523466 | 122040 | 15060.74 | 112500.00 | 136444.50 | NaN | 112500.00 | 11 | 1 | NaN | NaN | NaN | -301 | -1 | 12.00 | 365243.00 | -271.00 | 59.00 | 365243.00 | 365243.00 | 1.00 |
| 3 | 2819243 | 176158 | 47041.33 | 450000.00 | 470790.00 | NaN | 450000.00 | 7 | 1 | NaN | NaN | NaN | -512 | -1 | 12.00 | 365243.00 | -482.00 | -152.00 | -182.00 | -177.00 | 1.00 |
| 4 | 1784265 | 202054 | 31924.40 | 337500.00 | 404055.00 | NaN | 337500.00 | 9 | 1 | NaN | NaN | NaN | -781 | -1 | 24.00 | NaN | NaN | NaN | NaN | NaN | NaN |
Dealing with incorrect data types¶
Check if we have any column with incorrect data type
df_prev.dtypes
SK_ID_PREV int64 SK_ID_CURR int64 NAME_CONTRACT_TYPE object AMT_ANNUITY float64 AMT_APPLICATION float64 AMT_CREDIT float64 AMT_DOWN_PAYMENT float64 AMT_GOODS_PRICE float64 WEEKDAY_APPR_PROCESS_START object HOUR_APPR_PROCESS_START int64 FLAG_LAST_APPL_PER_CONTRACT object NFLAG_LAST_APPL_IN_DAY int64 RATE_DOWN_PAYMENT float64 RATE_INTEREST_PRIMARY float64 RATE_INTEREST_PRIVILEGED float64 NAME_CASH_LOAN_PURPOSE object NAME_CONTRACT_STATUS object DAYS_DECISION int64 NAME_PAYMENT_TYPE object CODE_REJECT_REASON object NAME_TYPE_SUITE object NAME_CLIENT_TYPE object NAME_GOODS_CATEGORY object NAME_PORTFOLIO object NAME_PRODUCT_TYPE object CHANNEL_TYPE object SELLERPLACE_AREA int64 NAME_SELLER_INDUSTRY object CNT_PAYMENT float64 NAME_YIELD_GROUP object PRODUCT_COMBINATION object DAYS_FIRST_DRAWING float64 DAYS_FIRST_DUE float64 DAYS_LAST_DUE_1ST_VERSION float64 DAYS_LAST_DUE float64 DAYS_TERMINATION float64 NFLAG_INSURED_ON_APPROVAL float64 dtype: object
df_prev.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.43 | 17145.00 | 17145.00 | 0.00 | 17145.00 | SATURDAY | 15 | Y | 1 | 0.00 | 0.18 | 0.87 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.00 | middle | POS mobile with interest | 365243.00 | -42.00 | 300.00 | -42.00 | -37.00 | 0.00 |
| 1 | 2802425 | 108129 | Cash loans | 25188.62 | 607500.00 | 679671.00 | NaN | 607500.00 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.00 | low_action | Cash X-Sell: low | 365243.00 | -134.00 | 916.00 | 365243.00 | 365243.00 | 1.00 |
| 2 | 2523466 | 122040 | Cash loans | 15060.74 | 112500.00 | 136444.50 | NaN | 112500.00 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.00 | high | Cash X-Sell: high | 365243.00 | -271.00 | 59.00 | 365243.00 | 365243.00 | 1.00 |
| 3 | 2819243 | 176158 | Cash loans | 47041.33 | 450000.00 | 470790.00 | NaN | 450000.00 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.00 | middle | Cash X-Sell: middle | 365243.00 | -482.00 | -152.00 | -182.00 | -177.00 | 1.00 |
| 4 | 1784265 | 202054 | Cash loans | 31924.40 | 337500.00 | 404055.00 | NaN | 337500.00 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.00 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
Looking at the data and their corresponding data types, we can conclude that No Data Type changes are required
Dealing with missing values¶
Check if we have any null values in the dataset
df_prev.isnull().values.any()
True
Get Total number of null values in the dataset
df_prev.isnull().values.sum()
11109336
Getting the list of column(s) which have null values
df_prev.columns[df_prev.isnull().any()]
Index(['AMT_ANNUITY', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'NAME_TYPE_SUITE', 'CNT_PAYMENT',
'PRODUCT_COMBINATION', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
len(df_prev.columns[df_prev.isnull().any()])
16
There are totally 16 columns having one or more NULL values in the data
Computing count and percentage of missing values¶
null_count = df_prev.isnull().sum()
null_percentage = round((df_prev.isnull().sum()/df_prev.shape[0])*100, 2)
null_df = pd.DataFrame({'column_name' : df_prev.columns,'null_count' : null_count,'null_percentage': null_percentage})
null_df.reset_index(drop = True, inplace = True)
null_df.sort_values(by = 'null_percentage', ascending = False)
| column_name | null_count | null_percentage | |
|---|---|---|---|
| 14 | RATE_INTEREST_PRIVILEGED | 1664263 | 99.64 |
| 13 | RATE_INTEREST_PRIMARY | 1664263 | 99.64 |
| 12 | RATE_DOWN_PAYMENT | 895844 | 53.64 |
| 6 | AMT_DOWN_PAYMENT | 895844 | 53.64 |
| 20 | NAME_TYPE_SUITE | 820405 | 49.12 |
| 36 | NFLAG_INSURED_ON_APPROVAL | 673065 | 40.30 |
| 31 | DAYS_FIRST_DRAWING | 673065 | 40.30 |
| 32 | DAYS_FIRST_DUE | 673065 | 40.30 |
| 33 | DAYS_LAST_DUE_1ST_VERSION | 673065 | 40.30 |
| 34 | DAYS_LAST_DUE | 673065 | 40.30 |
| 35 | DAYS_TERMINATION | 673065 | 40.30 |
| 7 | AMT_GOODS_PRICE | 385515 | 23.08 |
| 3 | AMT_ANNUITY | 372235 | 22.29 |
| 28 | CNT_PAYMENT | 372230 | 22.29 |
| 30 | PRODUCT_COMBINATION | 346 | 0.02 |
| 25 | CHANNEL_TYPE | 0 | 0.00 |
| 24 | NAME_PRODUCT_TYPE | 0 | 0.00 |
| 29 | NAME_YIELD_GROUP | 0 | 0.00 |
| 26 | SELLERPLACE_AREA | 0 | 0.00 |
| 27 | NAME_SELLER_INDUSTRY | 0 | 0.00 |
| 22 | NAME_GOODS_CATEGORY | 0 | 0.00 |
| 23 | NAME_PORTFOLIO | 0 | 0.00 |
| 0 | SK_ID_PREV | 0 | 0.00 |
| 21 | NAME_CLIENT_TYPE | 0 | 0.00 |
| 19 | CODE_REJECT_REASON | 0 | 0.00 |
| 1 | SK_ID_CURR | 0 | 0.00 |
| 17 | DAYS_DECISION | 0 | 0.00 |
| 16 | NAME_CONTRACT_STATUS | 0 | 0.00 |
| 15 | NAME_CASH_LOAN_PURPOSE | 0 | 0.00 |
| 11 | NFLAG_LAST_APPL_IN_DAY | 0 | 0.00 |
| 10 | FLAG_LAST_APPL_PER_CONTRACT | 0 | 0.00 |
| 9 | HOUR_APPR_PROCESS_START | 0 | 0.00 |
| 8 | WEEKDAY_APPR_PROCESS_START | 0 | 0.00 |
| 5 | AMT_CREDIT | 1 | 0.00 |
| 4 | AMT_APPLICATION | 0 | 0.00 |
| 2 | NAME_CONTRACT_TYPE | 0 | 0.00 |
| 18 | NAME_PAYMENT_TYPE | 0 | 0.00 |
Removing columns with NULL values > 40%¶
Getting list of columns with NULL values > 40% into a list. We will be removing these columns from the dataframe as there are too many missing values.
columns_to_be_deleted = null_df[null_df['null_percentage'] > 40].column_name.to_list()
columns_to_be_deleted
['AMT_DOWN_PAYMENT', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED', 'NAME_TYPE_SUITE', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL']
len(columns_to_be_deleted)
11
There are totally 11 columns to be removed. Deleting them from previous application dataframe df_prev
df_prev.drop(columns = columns_to_be_deleted, inplace = True)
Checking column count post removal. Only 26 columns should be left
df_prev.shape
(1670214, 26)
Checking columns with NULL values < 40%¶
Creating dataframe null_df_under40 with missing column percentages under 40%
null_df_under40 = null_df[null_df['null_percentage'] < 40]
null_df_under40.sort_values(by = 'null_percentage', ascending = False)
| column_name | null_count | null_percentage | |
|---|---|---|---|
| 7 | AMT_GOODS_PRICE | 385515 | 23.08 |
| 3 | AMT_ANNUITY | 372235 | 22.29 |
| 28 | CNT_PAYMENT | 372230 | 22.29 |
| 30 | PRODUCT_COMBINATION | 346 | 0.02 |
| 18 | NAME_PAYMENT_TYPE | 0 | 0.00 |
| 29 | NAME_YIELD_GROUP | 0 | 0.00 |
| 27 | NAME_SELLER_INDUSTRY | 0 | 0.00 |
| 26 | SELLERPLACE_AREA | 0 | 0.00 |
| 25 | CHANNEL_TYPE | 0 | 0.00 |
| 24 | NAME_PRODUCT_TYPE | 0 | 0.00 |
| 23 | NAME_PORTFOLIO | 0 | 0.00 |
| 22 | NAME_GOODS_CATEGORY | 0 | 0.00 |
| 21 | NAME_CLIENT_TYPE | 0 | 0.00 |
| 19 | CODE_REJECT_REASON | 0 | 0.00 |
| 0 | SK_ID_PREV | 0 | 0.00 |
| 1 | SK_ID_CURR | 0 | 0.00 |
| 16 | NAME_CONTRACT_STATUS | 0 | 0.00 |
| 15 | NAME_CASH_LOAN_PURPOSE | 0 | 0.00 |
| 11 | NFLAG_LAST_APPL_IN_DAY | 0 | 0.00 |
| 10 | FLAG_LAST_APPL_PER_CONTRACT | 0 | 0.00 |
| 9 | HOUR_APPR_PROCESS_START | 0 | 0.00 |
| 8 | WEEKDAY_APPR_PROCESS_START | 0 | 0.00 |
| 5 | AMT_CREDIT | 1 | 0.00 |
| 4 | AMT_APPLICATION | 0 | 0.00 |
| 2 | NAME_CONTRACT_TYPE | 0 | 0.00 |
| 17 | DAYS_DECISION | 0 | 0.00 |
Analysis of AMT_GOODS_PRICE column¶
- nullable values = 23.08%
df_prev.AMT_GOODS_PRICE.value_counts().head()
45000.00 47831 225000.00 43549 135000.00 40666 450000.00 38926 90000.00 29367 Name: AMT_GOODS_PRICE, dtype: int64
sns.boxplot(df_prev.AMT_GOODS_PRICE)
plt.show()
Getting percentile values for AMT_GOODS_PRICE
df_prev.AMT_GOODS_PRICE.quantile(q = [0.25,0.5,0.75,1])
0.25 50841.00 0.50 112320.00 0.75 234000.00 1.00 6905160.00 Name: AMT_GOODS_PRICE, dtype: float64
Most recurring value in AMT_GOODS_PRICE
df_prev.AMT_GOODS_PRICE.mode()[0]
45000.0
Checking the average value of AMT_GOODS_PRICE
df_prev.AMT_GOODS_PRICE.mean()
227847.27928334344
Observations
- Looking at the boxplot, median is 112320.00
- Most recurring value is 45000.0
- Mean value is 227847.27928334344
- Since missing percentage value is higher (23.08%), it would be better to leave the data as it is and not perform imputations
Analysis of AMT_ANNUITY column¶
- nullable values = 22.29%
df_prev.AMT_ANNUITY.value_counts().head()
2250.00 31865 11250.00 13974 6750.00 13442 9000.00 12496 22500.00 11903 Name: AMT_ANNUITY, dtype: int64
sns.boxplot(df_prev.AMT_ANNUITY)
plt.show()
Getting percentile values for AMT_ANNUITY
df_prev.AMT_ANNUITY.quantile(q = [0.25,0.5,0.75,1])
0.25 6321.78 0.50 11250.00 0.75 20658.42 1.00 418058.15 Name: AMT_ANNUITY, dtype: float64
Most recurring value in AMT_ANNUITY
df_prev.AMT_ANNUITY.mode()[0]
2250.0
Checking the average value of AMT_ANNUITY
df_prev.AMT_ANNUITY.mean()
15955.120659450406
Observations
- Looking at the boxplot, median is 11250.00
- Most recurring value is 2250.0
- Mean value is 15955.120659450406
- Since missing percentage value is higher (22.29%), it would be better to leave the data as it is and not perform imputations
Analysis of CNT_PAYMENT column¶
- nullable values = 22.29%
df_prev.CNT_PAYMENT.value_counts().head()
12.00 323049 6.00 190461 0.00 144985 10.00 141851 24.00 137764 Name: CNT_PAYMENT, dtype: int64
sns.boxplot(df_prev.CNT_PAYMENT)
plt.show()
Getting percentile values for CNT_PAYMENT
df_prev.CNT_PAYMENT.quantile(q = [0.25,0.5,0.75,1])
0.25 6.00 0.50 12.00 0.75 24.00 1.00 84.00 Name: CNT_PAYMENT, dtype: float64
Most recurring value in CNT_PAYMENT
df_prev.CNT_PAYMENT.mode()[0]
12.0
Checking the average value of CNT_PAYMENT
df_prev.CNT_PAYMENT.mean()
16.0540815603274
Observations
- Looking at the boxplot, median is 12.00
- Most recurring value is 12.0
- Mean value is 16.0540815603274
- Though median & mode are same, since missing percentage value is higher (22.29%), it would be better to leave the data as it is and not perform imputations
Checking columns with NULL values > 0% and < 1%¶
Creating dataframe null_df_under1 with missing column percentages values > 0% and < 1%
null_df_under1 = null_df[(null_df['null_percentage'] > 0) & (null_df['null_percentage'] < 1)]
null_df_under1.sort_values(by = 'null_percentage', ascending = False)
| column_name | null_count | null_percentage | |
|---|---|---|---|
| 30 | PRODUCT_COMBINATION | 346 | 0.02 |
Analysis of PRODUCT_COMBINATION column¶
- nullable values = 0.02%
df_prev['PRODUCT_COMBINATION'].value_counts().head()
Cash 285990 POS household with interest 263622 POS mobile with interest 220670 Cash X-Sell: middle 143883 Cash X-Sell: low 130248 Name: PRODUCT_COMBINATION, dtype: int64
plt.figure(figsize = (10,5))
sns.countplot(data = df_prev, x = "PRODUCT_COMBINATION")
plt.xticks(rotation = 90)
plt.show()
Observations
- Looking at the plot,
CashProduct_Combination has the highest number of loan applicants - We can go ahead and impute
Cashin the dataframe
Dealing with incorrect/unknown data values¶
Analysis of NAME_CASH_LOAN_PURPOSE column¶
Checking range of values
df_prev['NAME_CASH_LOAN_PURPOSE'].value_counts(normalize = True).head()
XAP 0.55 XNA 0.41 Repairs 0.01 Other 0.01 Urgent needs 0.01 Name: NAME_CASH_LOAN_PURPOSE, dtype: float64
Observations
- Though
XAPandXNAdon't provide any understanding of Loan purpose, they indicate Not Available and Not Applicable - Since this makes up 99% of the data, we will leave it as is
Analysis of DAYS_DECISION column¶
df_prev['DAYS_DECISION'].value_counts()
-245 2444
-238 2390
-210 2375
-273 2350
-196 2315
...
-2882 163
-2922 162
-2921 158
-2902 148
-1 2
Name: DAYS_DECISION, Length: 2922, dtype: int64
There are 2922 unique records all of which seem to be having negative values
df_prev['DAYS_DECISION'].unique()
array([ -73, -164, -301, ..., -1967, -2389, -1])
df_prev['DAYS_DECISION'].nunique()
2922
Converting DAYS_DECISION to positive days
df_prev['DAYS_DECISION'] = df_prev['DAYS_DECISION'].apply(lambda x: -x if x < 0 else x)
df_prev['DAYS_DECISION'].value_counts()
245 2444
238 2390
210 2375
273 2350
196 2315
...
2882 163
2922 162
2921 158
2902 148
1 2
Name: DAYS_DECISION, Length: 2922, dtype: int64
All Days in DAYS_DECISION have positive values
Analysis of NAME_PAYMENT_TYPE column¶
Checking range of values
df_prev['NAME_PAYMENT_TYPE'].value_counts(normalize = True)
Cash through the bank 0.62 XNA 0.38 Non-cash from your account 0.00 Cashless from the account of the employer 0.00 Name: NAME_PAYMENT_TYPE, dtype: float64
Observations
- Though
XNAdoesn't provide any understanding of Payment TYpe, it indicates Not Applicable - Since this makes up 38% of the data, we will leave it as is
Analysis of CODE_REJECT_REASON column¶
Checking range of values
df_prev['CODE_REJECT_REASON'].value_counts(normalize = True)
XAP 0.81 HC 0.10 LIMIT 0.03 SCO 0.02 CLIENT 0.02 SCOFR 0.01 XNA 0.00 VERIF 0.00 SYSTEM 0.00 Name: CODE_REJECT_REASON, dtype: float64
Observations
- Though
XAPdoesn't provide any understanding of Payment Type, it indicates Not Available - Since this makes up 81% of the data, we will leave it as is
Analysis of NAME_CLIENT_TYPE column¶
Checking range of values
df_prev['NAME_CLIENT_TYPE'].value_counts(normalize=True)
Repeater 0.74 New 0.18 Refreshed 0.08 XNA 0.00 Name: NAME_CLIENT_TYPE, dtype: float64
XNA value may indicate that the value was not provided by the loan applicant or missed by the loan officer verifying the application
df_prev[df_prev['NAME_CLIENT_TYPE'] == 'XNA'].head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 694 | 1044207 | 241230 | Cash loans | NaN | 0.00 | 0.00 | NaN | SUNDAY | 12 | Y | 1 | XNA | Canceled | 53 | XNA | XAP | XNA | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash |
| 3419 | 1167657 | 366138 | Cash loans | 55400.89 | 1354500.00 | 1451047.50 | 1354500.00 | MONDAY | 17 | Y | 1 | XNA | Refused | 231 | Cash through the bank | HC | XNA | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 36.00 | low_normal | Cash X-Sell: low |
| 3594 | 2609511 | 334664 | Consumer loans | 2223.99 | 20659.50 | 23112.00 | 20659.50 | MONDAY | 11 | Y | 1 | XAP | Approved | 245 | Cash through the bank | XAP | XNA | Computers | POS | XNA | Country-wide | 50 | Consumer electronics | 12.00 | low_action | POS household without interest |
| 5819 | 2092311 | 281867 | Cash loans | NaN | 0.00 | 0.00 | NaN | MONDAY | 13 | Y | 1 | XNA | Canceled | 242 | XNA | XAP | XNA | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash |
| 5990 | 2680830 | 335491 | Cash loans | 32048.46 | 877500.00 | 1016496.00 | 877500.00 | MONDAY | 9 | Y | 1 | Repairs | Refused | 195 | Cash through the bank | HC | XNA | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 48.00 | low_normal | Cash Street: low |
As data looks valid, we will go ahead and check for an imputation method.
Repeaterapplicants make up 74% of applicants- And so, we will go ahead and impute
NAME_CLIENT_TYPEwith 'Repeater'
df_prev['NAME_CLIENT_TYPE'] = df_prev['NAME_CLIENT_TYPE'].apply(lambda x: 'Repeater' if x == 'XNA' else x)
Checking if XNA is removed
df_prev['NAME_CLIENT_TYPE'].value_counts(normalize=True)
Repeater 0.74 New 0.18 Refreshed 0.08 Name: NAME_CLIENT_TYPE, dtype: float64
There are other columns with XNA but we will leave them as is
Dealing with outliers for numerical columns¶
Analysis of AMT_ANNUITY column¶
df_prev['AMT_ANNUITY'].value_counts().sort_values(ascending = False).head()
2250.00 31865 11250.00 13974 6750.00 13442 9000.00 12496 22500.00 11903 Name: AMT_ANNUITY, dtype: int64
(df_prev['AMT_ANNUITY'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
2250.00 2.45 11250.00 1.08 6750.00 1.04 9000.00 0.96 22500.00 0.92 Name: AMT_ANNUITY, dtype: float64
df_prev.AMT_ANNUITY.quantile(q = [0.25,0.5,0.75,0.99,1])
0.25 6321.78 0.50 11250.00 0.75 20658.42 0.99 69685.79 1.00 418058.15 Name: AMT_ANNUITY, dtype: float64
fn_dist_box(df_prev,'AMT_ANNUITY')
Calculating IQR (Inter Quartile range)
Q1 = df_prev['AMT_ANNUITY'].quantile(0.25)
Q3 = df_prev['AMT_ANNUITY'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
14336.64
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -15183.18 Max value after which outlier exist: 42163.38
Observations
AMT_ANNUITYvalues above 42163.38 are outliers
Analysis of AMT_APPLICATION column¶
df_prev['AMT_APPLICATION'].value_counts().sort_values(ascending = False).head()
0.00 392402 45000.00 47831 225000.00 43543 135000.00 40678 450000.00 38905 Name: AMT_APPLICATION, dtype: int64
(df_prev['AMT_APPLICATION'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
0.00 23.49 45000.00 2.86 225000.00 2.61 135000.00 2.44 450000.00 2.33 Name: AMT_APPLICATION, dtype: float64
df_prev.AMT_APPLICATION.quantile(q = [0.25,0.5,0.75,0.99,1])
0.25 18720.00 0.50 71046.00 0.75 180360.00 0.99 1350000.00 1.00 6905160.00 Name: AMT_APPLICATION, dtype: float64
fn_dist_box(df_prev,'AMT_APPLICATION')
Calculating IQR (Inter Quartile range)
Q1 = df_prev['AMT_APPLICATION'].quantile(0.25)
Q3 = df_prev['AMT_APPLICATION'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
161640.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -223740.0 Max value after which outlier exist: 422820.0
Observations
AMT_APPLICATIONvalues above 422820.0 are outliers
Analysis of AMT_CREDIT column¶
df_prev['AMT_CREDIT'].value_counts().sort_values(ascending = False).head()
0.00 336768 45000.00 35051 225000.00 21094 450000.00 19954 135000.00 18720 Name: AMT_CREDIT, dtype: int64
(df_prev['AMT_CREDIT'].value_counts(normalize = True).sort_values(ascending = False) * 100).head()
0.00 20.16 45000.00 2.10 225000.00 1.26 450000.00 1.19 135000.00 1.12 Name: AMT_CREDIT, dtype: float64
df_prev.AMT_CREDIT.quantile(q = [0.25,0.5,0.75,0.99,1])
0.25 24160.50 0.50 80541.00 0.75 216418.50 0.99 1515377.70 1.00 6905160.00 Name: AMT_CREDIT, dtype: float64
fn_dist_box(df_prev,'AMT_CREDIT')
Calculating IQR (Inter Quartile range)
Q1 = df_prev['AMT_CREDIT'].quantile(0.25)
Q3 = df_prev['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
192258.0
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -264226.5 Max value after which outlier exist: 504805.5
Observations
AMT_CREDITvalues above 504805.5 are outliers
Merge datasets df and df_prev into df_merge¶
df_merge = df.merge(df_prev, left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')
df.shape
(307511, 81)
df_prev.shape
(1670214, 26)
df_merge.shape
(1413701, 106)
Check structure of the data¶
df_merge.info(verbose = True,null_counts = True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 1413701 entries, 0 to 1413700 Data columns (total 106 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 1413701 non-null int64 1 TARGET 1413701 non-null int64 2 NAME_CONTRACT_TYPE_x 1413701 non-null object 3 CODE_GENDER 1413701 non-null object 4 FLAG_OWN_CAR 1413701 non-null object 5 FLAG_OWN_REALTY 1413701 non-null object 6 CNT_CHILDREN 1413701 non-null int64 7 AMT_INCOME_TOTAL 1413701 non-null float64 8 AMT_CREDIT_x 1413701 non-null float64 9 AMT_ANNUITY_x 1413608 non-null float64 10 AMT_GOODS_PRICE_x 1412493 non-null float64 11 NAME_TYPE_SUITE 1410175 non-null object 12 NAME_INCOME_TYPE 1413701 non-null object 13 NAME_EDUCATION_TYPE 1413701 non-null object 14 NAME_FAMILY_STATUS 1413701 non-null object 15 NAME_HOUSING_TYPE 1413701 non-null object 16 REGION_POPULATION_RELATIVE 1413701 non-null float64 17 DAYS_BIRTH 1413701 non-null int64 18 DAYS_EMPLOYED 1413701 non-null int64 19 DAYS_REGISTRATION 1413701 non-null float64 20 DAYS_ID_PUBLISH 1413701 non-null int64 21 FLAG_MOBIL 1413701 non-null int64 22 FLAG_EMP_PHONE 1413701 non-null int64 23 FLAG_WORK_PHONE 1413701 non-null int64 24 FLAG_CONT_MOBILE 1413701 non-null int64 25 FLAG_PHONE 1413701 non-null int64 26 FLAG_EMAIL 1413701 non-null int64 27 OCCUPATION_TYPE 1413701 non-null object 28 CNT_FAM_MEMBERS 1413701 non-null float64 29 REGION_RATING_CLIENT 1413701 non-null int64 30 REGION_RATING_CLIENT_W_CITY 1413701 non-null int64 31 WEEKDAY_APPR_PROCESS_START_x 1413701 non-null object 32 HOUR_APPR_PROCESS_START_x 1413701 non-null int64 33 REG_REGION_NOT_LIVE_REGION 1413701 non-null int64 34 REG_REGION_NOT_WORK_REGION 1413701 non-null int64 35 LIVE_REGION_NOT_WORK_REGION 1413701 non-null int64 36 REG_CITY_NOT_LIVE_CITY 1413701 non-null int64 37 REG_CITY_NOT_WORK_CITY 1413701 non-null int64 38 LIVE_CITY_NOT_WORK_CITY 1413701 non-null int64 39 ORGANIZATION_TYPE 1413701 non-null object 40 EXT_SOURCE_2 1411880 non-null float64 41 EXT_SOURCE_3 1171859 non-null float64 42 OBS_30_CNT_SOCIAL_CIRCLE 1410555 non-null float64 43 DEF_30_CNT_SOCIAL_CIRCLE 1410555 non-null float64 44 OBS_60_CNT_SOCIAL_CIRCLE 1410555 non-null float64 45 DEF_60_CNT_SOCIAL_CIRCLE 1410555 non-null float64 46 DAYS_LAST_PHONE_CHANGE 1413701 non-null float64 47 FLAG_DOCUMENT_2 1413701 non-null int64 48 FLAG_DOCUMENT_3 1413701 non-null int64 49 FLAG_DOCUMENT_4 1413701 non-null int64 50 FLAG_DOCUMENT_5 1413701 non-null int64 51 FLAG_DOCUMENT_6 1413701 non-null int64 52 FLAG_DOCUMENT_7 1413701 non-null int64 53 FLAG_DOCUMENT_8 1413701 non-null int64 54 FLAG_DOCUMENT_9 1413701 non-null int64 55 FLAG_DOCUMENT_10 1413701 non-null int64 56 FLAG_DOCUMENT_11 1413701 non-null int64 57 FLAG_DOCUMENT_12 1413701 non-null int64 58 FLAG_DOCUMENT_13 1413701 non-null int64 59 FLAG_DOCUMENT_14 1413701 non-null int64 60 FLAG_DOCUMENT_15 1413701 non-null int64 61 FLAG_DOCUMENT_16 1413701 non-null int64 62 FLAG_DOCUMENT_17 1413701 non-null int64 63 FLAG_DOCUMENT_18 1413701 non-null int64 64 FLAG_DOCUMENT_19 1413701 non-null int64 65 FLAG_DOCUMENT_20 1413701 non-null int64 66 FLAG_DOCUMENT_21 1413701 non-null int64 67 AMT_REQ_CREDIT_BUREAU_HOUR 1250074 non-null float64 68 AMT_REQ_CREDIT_BUREAU_DAY 1250074 non-null float64 69 AMT_REQ_CREDIT_BUREAU_WEEK 1250074 non-null float64 70 AMT_REQ_CREDIT_BUREAU_MON 1250074 non-null float64 71 AMT_REQ_CREDIT_BUREAU_QRT 1250074 non-null float64 72 AMT_REQ_CREDIT_BUREAU_YEAR 1250074 non-null float64 73 YEARS_BIRTH 1413701 non-null int64 74 YEARS_EMPLOYED 1413701 non-null int64 75 YEARS_REGISTRATION 1413701 non-null int64 76 YEARS_ID_PUBLISH 1413701 non-null int64 77 YEARS_LAST_PHONE_CHANGE 1413701 non-null float64 78 AMT_GOODS_PRICE_CATEGORY 1412493 non-null category 79 YEARS_BIRTH_CATEGORY 1413701 non-null object 80 YEARS_REGISTRATION_CATEGORY 1413701 non-null object 81 SK_ID_PREV 1413701 non-null int64 82 NAME_CONTRACT_TYPE_y 1413701 non-null object 83 AMT_ANNUITY_y 1106483 non-null float64 84 AMT_APPLICATION 1413701 non-null float64 85 AMT_CREDIT_y 1413700 non-null float64 86 AMT_GOODS_PRICE_y 1094176 non-null float64 87 WEEKDAY_APPR_PROCESS_START_y 1413701 non-null object 88 HOUR_APPR_PROCESS_START_y 1413701 non-null int64 89 FLAG_LAST_APPL_PER_CONTRACT 1413701 non-null object 90 NFLAG_LAST_APPL_IN_DAY 1413701 non-null int64 91 NAME_CASH_LOAN_PURPOSE 1413701 non-null object 92 NAME_CONTRACT_STATUS 1413701 non-null object 93 DAYS_DECISION 1413701 non-null int64 94 NAME_PAYMENT_TYPE 1413701 non-null object 95 CODE_REJECT_REASON 1413701 non-null object 96 NAME_CLIENT_TYPE 1413701 non-null object 97 NAME_GOODS_CATEGORY 1413701 non-null object 98 NAME_PORTFOLIO 1413701 non-null object 99 NAME_PRODUCT_TYPE 1413701 non-null object 100 CHANNEL_TYPE 1413701 non-null object 101 SELLERPLACE_AREA 1413701 non-null int64 102 NAME_SELLER_INDUSTRY 1413701 non-null object 103 CNT_PAYMENT 1106488 non-null float64 104 NAME_YIELD_GROUP 1413701 non-null object 105 PRODUCT_COMBINATION 1413388 non-null object dtypes: category(1), float64(26), int64(50), object(29) memory usage: 1.1+ GB
We do not see any columns with Nullable values
df_merge.shape
(1413701, 106)
There are ~1.41 million rows and 106 columns
df_merge[df_merge['SK_ID_CURR'] == 265681][['AMT_CREDIT_x','AMT_CREDIT_y']].head()
| AMT_CREDIT_x | AMT_CREDIT_y | |
|---|---|---|
| 654385 | 278460.00 | 109552.50 |
| 654386 | 278460.00 | 22234.50 |
| 654387 | 278460.00 | 225000.00 |
| 654388 | 278460.00 | 118989.00 |
| 654389 | 278460.00 | 337824.00 |
df[df['SK_ID_CURR'] == 265681][['AMT_CREDIT']].head()
| AMT_CREDIT | |
|---|---|
| 142895 | 278460.00 |
Get statistical summary for numerical variables¶
df_merge.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START_x | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEARS_BIRTH | YEARS_EMPLOYED | YEARS_REGISTRATION | YEARS_ID_PUBLISH | YEARS_LAST_PHONE_CHANGE | SK_ID_PREV | AMT_ANNUITY_y | AMT_APPLICATION | AMT_CREDIT_y | AMT_GOODS_PRICE_y | HOUR_APPR_PROCESS_START_y | NFLAG_LAST_APPL_IN_DAY | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413608.00 | 1412493.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1411880.00 | 1171859.00 | 1410555.00 | 1410555.00 | 1410555.00 | 1410555.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1250074.00 | 1250074.00 | 1250074.00 | 1250074.00 | 1250074.00 | 1250074.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1106483.00 | 1413701.00 | 1413700.00 | 1094176.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1413701.00 | 1106488.00 |
| mean | 278481.29 | 0.09 | 0.40 | 173316.04 | 587553.67 | 27017.02 | 527718.59 | 0.02 | 16321.05 | 72663.47 | 5003.23 | 3034.56 | 1.00 | 0.81 | 0.18 | 1.00 | 0.28 | 0.07 | 2.15 | 2.06 | 2.04 | 11.98 | 0.01 | 0.04 | 0.04 | 0.07 | 0.22 | 0.17 | 0.51 | 0.49 | 1.54 | 0.15 | 1.53 | 0.11 | 1084.70 | 0.00 | 0.74 | 0.00 | 0.01 | 0.09 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.04 | 0.30 | 0.36 | 2.91 | 44.72 | 199.14 | 13.70 | 8.31 | 2.97 | 1922744.33 | 15837.18 | 175243.59 | 196354.09 | 226451.19 | 12.48 | 1.00 | 880.37 | 314.99 | 16.05 |
| std | 102811.79 | 0.28 | 0.72 | 198573.37 | 384917.33 | 13951.16 | 353246.49 | 0.01 | 4344.56 | 143337.36 | 3551.05 | 1507.38 | 0.00 | 0.40 | 0.39 | 0.03 | 0.45 | 0.26 | 0.90 | 0.50 | 0.49 | 3.23 | 0.11 | 0.21 | 0.19 | 0.26 | 0.42 | 0.38 | 0.19 | 0.20 | 2.53 | 0.47 | 2.51 | 0.38 | 799.94 | 0.01 | 0.44 | 0.01 | 0.12 | 0.29 | 0.01 | 0.26 | 0.06 | 0.01 | 0.04 | 0.00 | 0.04 | 0.04 | 0.03 | 0.08 | 0.01 | 0.06 | 0.02 | 0.02 | 0.01 | 0.08 | 0.11 | 0.21 | 0.98 | 0.93 | 2.20 | 11.91 | 392.84 | 9.74 | 4.15 | 2.21 | 532715.34 | 14724.91 | 293622.19 | 319481.25 | 315937.60 | 3.33 | 0.06 | 783.54 | 7695.08 | 14.57 |
| min | 100002.00 | 0.00 | 0.00 | 25650.00 | 45000.00 | 1615.50 | 40500.00 | 0.00 | 7489.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 21.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1000001.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | -1.00 | 0.00 |
| 25% | 189364.00 | 0.00 | 0.00 | 112500.00 | 270000.00 | 16821.00 | 238500.00 | 0.01 | 12739.00 | 1042.00 | 2001.00 | 1783.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.39 | 0.34 | 0.00 | 0.00 | 0.00 | 0.00 | 396.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 35.00 | 3.00 | 5.00 | 5.00 | 1.00 | 1461346.00 | 6257.86 | 19750.50 | 24880.50 | 49828.50 | 10.00 | 1.00 | 271.00 | -1.00 | 6.00 |
| 50% | 278992.00 | 0.00 | 0.00 | 157500.00 | 508495.50 | 24925.50 | 450000.00 | 0.02 | 16044.00 | 2401.00 | 4508.00 | 3330.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.56 | 0.51 | 0.00 | 0.00 | 0.00 | 0.00 | 1011.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | 44.00 | 7.00 | 12.00 | 9.00 | 3.00 | 1922698.00 | 11223.76 | 70870.50 | 80595.00 | 110245.50 | 12.00 | 1.00 | 582.00 | 4.00 | 12.00 |
| 75% | 367556.00 | 0.00 | 1.00 | 207000.00 | 807984.00 | 34542.00 | 679500.00 | 0.03 | 19980.00 | 6313.00 | 7510.00 | 4319.00 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.66 | 0.65 | 2.00 | 0.00 | 2.00 | 0.00 | 1683.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | 55.00 | 17.00 | 21.00 | 12.00 | 5.00 | 2384012.00 | 20420.78 | 180000.00 | 215640.00 | 229500.00 | 15.00 | 1.00 | 1313.00 | 85.00 | 24.00 |
| max | 456255.00 | 1.00 | 19.00 | 117000000.00 | 4050000.00 | 225000.00 | 4050000.00 | 0.07 | 25201.00 | 365243.00 | 24672.00 | 7197.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.85 | 0.90 | 348.00 | 34.00 | 344.00 | 24.00 | 4292.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 | 69.00 | 1001.00 | 68.00 | 20.00 | 12.00 | 2845381.00 | 418058.15 | 5850000.00 | 4509688.50 | 5850000.00 | 23.00 | 1.00 | 2922.00 | 4000000.00 | 84.00 |
Analyzing categorical variables¶
df_merge.select_dtypes(include = "object").columns
Index(['NAME_CONTRACT_TYPE_x', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START_x', 'ORGANIZATION_TYPE',
'YEARS_BIRTH_CATEGORY', 'YEARS_REGISTRATION_CATEGORY',
'NAME_CONTRACT_TYPE_y', 'WEEKDAY_APPR_PROCESS_START_y',
'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON',
'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO',
'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'NAME_SELLER_INDUSTRY',
'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'],
dtype='object')
# Checking number of categorical variables
len(df_merge.select_dtypes(include = "object").columns)
29
There are 29 categorical variables
Analyzing numerical variables¶
df_merge.select_dtypes(include=["int64","float64"]).columns
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT_x', 'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x',
'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START_x',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'YEARS_BIRTH', 'YEARS_EMPLOYED', 'YEARS_REGISTRATION',
'YEARS_ID_PUBLISH', 'YEARS_LAST_PHONE_CHANGE', 'SK_ID_PREV',
'AMT_ANNUITY_y', 'AMT_APPLICATION', 'AMT_CREDIT_y', 'AMT_GOODS_PRICE_y',
'HOUR_APPR_PROCESS_START_y', 'NFLAG_LAST_APPL_IN_DAY', 'DAYS_DECISION',
'SELLERPLACE_AREA', 'CNT_PAYMENT'],
dtype='object')
# Checking number of categorical variables
len(df_merge.select_dtypes(include=["int64","float64"]).columns)
76
There are 76 numerical variables
Univariate analysis of categorical variables¶
Analysis of NAME_CONTRACT_STATUS¶
df_merge['NAME_CONTRACT_STATUS'].value_counts().sort_values(ascending = False)
Approved 886099 Canceled 259441 Refused 245390 Unused offer 22771 Name: NAME_CONTRACT_STATUS, dtype: int64
fn_uni_countplot_merge('NAME_CONTRACT_STATUS')
fn_uni_piechart_merge('NAME_CONTRACT_STATUS')
Observations
Approvedloan status is the highest among all loan applicationsCanceledloan status is the second highest among all loan applications
Analysis of NAME_CLIENT_TYPE¶
df_merge['NAME_CLIENT_TYPE'].value_counts().sort_values(ascending = False)
Repeater 1039225 New 259540 Refreshed 114936 Name: NAME_CLIENT_TYPE, dtype: int64
fn_uni_countplot_merge('NAME_CLIENT_TYPE')
fn_uni_piechart_merge('NAME_CLIENT_TYPE')
Observations
Repeaterclient type is the highest among all loan applicationsNewclient type is the second highest among all loan applications
Analysis of CHANNEL_TYPE¶
df_merge['CHANNEL_TYPE'].value_counts().sort_values(ascending = False)
Credit and cash offices 600342 Country-wide 423062 Stone 183508 Regional / Local 92472 Contact center 58891 AP+ (Cash loan) 49624 Channel of corporate sales 5396 Car dealer 406 Name: CHANNEL_TYPE, dtype: int64
fn_uni_countplot_merge('CHANNEL_TYPE')
fn_uni_piechart_merge('CHANNEL_TYPE')
Observations
Country-wideChannel type is the highest among all loan applicationsCredit and cash officesis the second highest Channel Type among all loan applications
Analysis of NAME_YIELD_GROUP¶
df_merge['NAME_YIELD_GROUP'].value_counts().sort_values(ascending = False)
XNA 429198 middle 324971 high 306061 low_normal 274884 low_action 78587 Name: NAME_YIELD_GROUP, dtype: int64
fn_uni_countplot_merge('NAME_YIELD_GROUP')
fn_uni_piechart_merge('NAME_YIELD_GROUP')
Observations
XNAinterest rate is the highest among all loan applicationsmiddleandhighinterest rates are the second and third highest among all loan applications
Analysis of NAME_GOODS_CATEGORY¶
df_merge['NAME_GOODS_CATEGORY'].value_counts().sort_values(ascending = False).head()
XNA 797209 Mobile 193743 Consumer Electronics 104785 Computers 89923 Audio/Video 85502 Name: NAME_GOODS_CATEGORY, dtype: int64
fn_uni_countplot_merge('NAME_GOODS_CATEGORY')
fn_uni_piechart_merge('NAME_GOODS_CATEGORY')
Observations
XNAgoods category is the highest among all loan applicationsmobilegoods categoryis the second highest among all loan applications
Univariate analysis of numerical variables¶
Analysis of AMT_APPLICATION¶
Outlier identification of AMT_APPLICATION¶
Calculating IQR (Inter Quartile range)
Q1 = df_merge['AMT_APPLICATION'].quantile(0.25)
Q3 = df_merge['AMT_APPLICATION'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
160249.5
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -220623.75 Max value after which outlier exist: 420374.25
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_APPLICATION'] <= Max_value].AMT_APPLICATION, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()
Observations
- Most of the loan amount applied by the clients initially seems to be very small as can be seen from the huge spike at the beginning of the distribution
Analysis of AMT_ANNUITY_y¶
Outlier identification of AMT_ANNUITY_y¶
Calculating IQR (Inter Quartile range)
Q1 = df_merge['AMT_ANNUITY_y'].quantile(0.25)
Q3 = df_merge['AMT_ANNUITY_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
14162.917500000001
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -14986.518750000001 Max value after which outlier exist: 41665.15125
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_ANNUITY_y'] <= Max_value].AMT_ANNUITY_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()
Observations
- Most of the previous loan's annuity from the clients is less than 10,000 as the distribution is high here
- As previous loan's annuity increases, the no. of clients decreases
Analysis of AMT_CREDIT_y¶
Outlier identification of AMT_CREDIT_y¶
Calculating IQR (Inter Quartile range)
Q1 = df_merge['AMT_CREDIT_y'].quantile(0.25)
Q3 = df_merge['AMT_CREDIT_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
190759.5
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -261258.75 Max value after which outlier exist: 501779.25
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_CREDIT_y'] <= Max_value].AMT_CREDIT_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()
Observations
- This distribution very closely resembles that of AMT_APPLICATION. This means that most people received the loan amount that they applied for
Analysis of AMT_GOODS_PRICE_y¶
Outlier identification of AMT_GOODS_PRICE_y¶
Calculating IQR (Inter Quartile range)
Q1 = df_merge['AMT_GOODS_PRICE_y'].quantile(0.25)
Q3 = df_merge['AMT_GOODS_PRICE_y'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
179671.5
Values before (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR) are outliers
Min_value = (Q1 - 1.5 * IQR)
Max_value = (Q3 + 1.5 * IQR)
print("Min value before which outlier exist: {}".format(Min_value))
print("Max value after which outlier exist: {}".format(Max_value))
Min value before which outlier exist: -219678.75 Max value after which outlier exist: 499007.25
Removing outliers and plotting distplot
plt.figure(figsize = [20,8])
sns.distplot(df_merge[df_merge['AMT_GOODS_PRICE_y'] <= Max_value].AMT_GOODS_PRICE_y, hist=True)
plt.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation = 45)
plt.show()
Observations
- Most of the goods price asked by clients in previous application is less than 100K
Correlation analysis of numerical variables¶
Plotting correlation matrix¶
corr_df = df_merge[['AMT_ANNUITY_x', 'AMT_APPLICATION','AMT_CREDIT_x', 'AMT_GOODS_PRICE_x',
'AMT_ANNUITY_y', 'AMT_CREDIT_y', 'AMT_GOODS_PRICE_y', 'CNT_PAYMENT']].corr()
corr_df.head()
| AMT_ANNUITY_x | AMT_APPLICATION | AMT_CREDIT_x | AMT_GOODS_PRICE_x | AMT_ANNUITY_y | AMT_CREDIT_y | AMT_GOODS_PRICE_y | CNT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| AMT_ANNUITY_x | 1.00 | 0.11 | 0.76 | 0.76 | 0.20 | 0.11 | 0.13 | -0.01 |
| AMT_APPLICATION | 0.11 | 1.00 | 0.12 | 0.12 | 0.81 | 0.98 | 1.00 | 0.68 |
| AMT_CREDIT_x | 0.76 | 0.12 | 1.00 | 0.99 | 0.15 | 0.12 | 0.14 | 0.04 |
| AMT_GOODS_PRICE_x | 0.76 | 0.12 | 0.99 | 1.00 | 0.16 | 0.12 | 0.14 | 0.04 |
| AMT_ANNUITY_y | 0.20 | 0.81 | 0.15 | 0.16 | 1.00 | 0.82 | 0.82 | 0.40 |
Creating a HeatMap
plt.figure(figsize = (20,8))
sns.heatmap(data = corr_df, annot = True, cmap = "RdYlGn", cbar = True, fmt='.2f')
plt.show()
Observations
AMT_APPLICATIONhas a high correlation withAMT_ANNUITY_y,AMT_CREDIT_y,AMT_GOODS_PRICE_yand decent correlation withCNT_PAYMENTAMT_GOODS_PRICE_yhas a high correlation withAMT_ANNUITY_y,AMT_CREDIT_y,AMT_APPLICATIONand decent correlation withCNT_PAYMENTAMT_CREDIT_yhas a high correlation withAMT_GOODS_PRICE_yand decent correlation withCNT_PAYMENTAMT_ANNUITY_xhas a high correlation withAMT_GOODS_PRICE_y,AMT_CREDIT_yAMT_ANNUITY_xhas a high correlation withAMT_GOODS_PRICE_x,AMT_CREDIT_xAMT_CREDIT_xhas a high correlation withAMT_GOODS_PRICE_x
Bivariate/Multivariate analysis¶
Continuous V/S Continuous variables¶
Analysis of AMT_GOODS_PRICE_y V/S AMT_CREDIT_y V/S NAME_CONTRACT_STATUS¶
Outlier identification of AMT_GOODS_PRICE_y
max_value1_AMT_GOODS_PRICE_y = outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value1_AMT_GOODS_PRICE_y
499007.25
Outlier identification of AMT_CREDIT_y
max_value1_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y
501779.25
Plotting scatterplot
plt.figure(figsize = [20,8])
sns.scatterplot(x = df_merge[df_merge['AMT_GOODS_PRICE_y'] < max_value1_AMT_GOODS_PRICE_y].AMT_GOODS_PRICE_y,
y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
- At lower levels of previous application's Goods price < 200K and Credit > 300k, have a chance of getting refused. However, this is a weak correlation as we have less data points to support this
Analysis of AMT_ANNUITY_y V/S AMT_CREDIT_y V/S NAME_CONTRACT_STATUS¶
Outlier identification of AMT_ANNUITY_y
max_value1_AMT_ANNUITY_y = outlier_range(df_merge,'AMT_ANNUITY_y')
max_value1_AMT_ANNUITY_y
41665.15125
Outlier identification of AMT_CREDIT_y
max_value1_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y
501779.25
Plotting scatterplot
plt.figure(figsize = [20,8])
sns.scatterplot(x = df_merge[df_merge['AMT_ANNUITY_y'] < max_value1_AMT_ANNUITY_y].AMT_ANNUITY_y,
y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
- There are lots of refusal observations with Annuity amount < 10000 and Credit amount > ~250K. This might be because higher credit amount should also require higher Annuity from Client to pay it
Analysis of AMT_APPLICATION V/S AMT_GOODS_PRICE_y V/S NAME_CONTRACT_STATUS¶
Outlier identification of AMT_APPLICATION
max_value1_AMT_APPLICATION = outlier_range(df_merge,'AMT_APPLICATION')
max_value1_AMT_APPLICATION
420374.25
Outlier identification of AMT_GOODS_PRICE_y
max_value1_AMT_GOODS_PRICE_y = outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value1_AMT_GOODS_PRICE_y
499007.25
Plotting scatterplot
plt.figure(figsize = [20,8])
sns.scatterplot(x = df_merge[df_merge['AMT_APPLICATION'] < max_value1_AMT_APPLICATION].AMT_APPLICATION,
y = df_merge[df_merge['AMT_GOODS_PRICE_y'] < max_value1_AMT_GOODS_PRICE_y].AMT_GOODS_PRICE_y,
data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
- Application amount has strong positive correlation with Goods price
Analysis of AMT_APPLICATION V/S AMT_CREDIT_y V/S NAME_CONTRACT_STATUS¶
Outlier identification of AMT_APPLICATION
max_value1_AMT_APPLICATION = outlier_range(df_merge,'AMT_APPLICATION')
max_value1_AMT_APPLICATION
420374.25
Outlier identification of AMT_CREDIT_y
max_value1_AMT_CREDIT_y= outlier_range(df_merge,'AMT_CREDIT_y')
max_value1_AMT_CREDIT_y
501779.25
Plotting scatterplot
plt.figure(figsize = [20,8])
sns.scatterplot(x = df_merge[df_merge['AMT_APPLICATION'] < max_value1_AMT_APPLICATION].AMT_APPLICATION,
y = df_merge[df_merge['AMT_CREDIT_y'] < max_value1_AMT_CREDIT_y].AMT_CREDIT_y,
data = df_merge,hue = 'NAME_CONTRACT_STATUS')
plt.ticklabel_format(style='plain', axis='x')
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout(pad = 4)
plt.show()
Observations
- Application amount has strong positive correlation with Credit amount
Continuous V/S Categorical variables¶
Analysis of NAME_CONTRACT_STATUS V/S AMT_CREDIT_y V/S CODE_GENDER¶
Outlier identification of AMT_CREDIT_y
max_value_AMT_CREDIT_y = outlier_range(df_merge,'AMT_CREDIT_y')
max_value_AMT_CREDIT_y
501779.25
Client with Payment difficulties
df_merge.groupby(by = ['NAME_CONTRACT_STATUS','CODE_GENDER']).AMT_CREDIT_y.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_CONTRACT_STATUS | CODE_GENDER | ||||||||
| Approved | F | 598648.00 | 207549.31 | 279373.33 | 0.00 | 48078.00 | 104935.50 | 229500.00 | 4509688.50 |
| M | 287450.00 | 189288.80 | 267951.42 | 0.00 | 45430.62 | 94140.00 | 197820.00 | 4050000.00 | |
| Canceled | F | 179044.00 | 24215.18 | 160577.58 | 0.00 | 0.00 | 0.00 | 0.00 | 3847104.00 |
| M | 80397.00 | 26053.06 | 174648.76 | 0.00 | 0.00 | 0.00 | 0.00 | 3524220.00 | |
| Refused | F | 163327.00 | 376531.00 | 458660.51 | 0.00 | 66159.00 | 202500.00 | 513531.00 | 4085550.00 |
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_CREDIT_y',max_value_AMT_CREDIT_y,'CODE_GENDER')
Observations
- Clients who are
RefusedandFemaleapply for higher median credit amount thanMale
Analysis of NAME_CONTRACT_STATUS V/S AMT_ANNUITY_y V/S CODE_GENDER¶
Outlier identification of AMT_ANNUITY_y
max_value_AMT_ANNUITY_y= outlier_range(df_merge,'AMT_ANNUITY_y')
max_value_AMT_ANNUITY_y
41665.15125
Client with Payment difficulties
df_merge.groupby(by = ['NAME_CONTRACT_STATUS','CODE_GENDER']).AMT_ANNUITY_y.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_CONTRACT_STATUS | CODE_GENDER | ||||||||
| Approved | F | 598644.00 | 14660.85 | 13624.36 | 0.00 | 5927.98 | 10424.61 | 18314.93 | 393868.66 |
| M | 287448.00 | 14513.24 | 13996.13 | 0.00 | 5790.79 | 10010.41 | 17900.91 | 229274.95 | |
| Canceled | F | 6073.00 | 28917.04 | 17584.22 | 1324.31 | 16236.18 | 25409.74 | 38133.00 | 210950.86 |
| M | 2687.00 | 33066.86 | 19936.90 | 1780.88 | 19257.17 | 29550.60 | 43799.47 | 218713.55 | |
| Refused | F | 141148.00 | 20318.15 | 16580.62 | 0.00 | 8765.83 | 15750.00 | 27512.04 | 418058.15 |
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_ANNUITY_y',max_value_AMT_ANNUITY_y,'CODE_GENDER')
Observations
- Clients who got
CancelledandMalepaid higher median Annuity thanFemale - Clients who got
RefusedandFemalepaid higher median Annuity thanMale
Analysis of NAME_CLIENT_TYPE V/S AMT_GOODS_PRICE_y V/S NAME_CONTRACT_STATUS¶
Outlier identification of AMT_GOODS_PRICE_y
max_value_AMT_GOODS_PRICE_y= outlier_range(df_merge,'AMT_GOODS_PRICE_y')
max_value_AMT_GOODS_PRICE_y
499007.25
Client with Payment difficulties
df_merge.groupby(by = ['NAME_CLIENT_TYPE','NAME_CONTRACT_STATUS']).AMT_GOODS_PRICE_y.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_CLIENT_TYPE | NAME_CONTRACT_STATUS | ||||||||
| New | Approved | 242297.00 | 103479.31 | 137367.96 | 0.00 | 38106.00 | 67500.00 | 121500.00 | 5850000.00 |
| Canceled | 186.00 | 481775.18 | 576602.84 | 22455.00 | 135000.00 | 225000.00 | 675000.00 | 3150000.00 | |
| Refused | 12219.00 | 169230.61 | 280373.28 | 0.00 | 45000.00 | 97452.00 | 180000.00 | 4050000.00 | |
| Unused offer | 1867.00 | 79936.32 | 61318.33 | 0.00 | 39325.50 | 66758.85 | 104231.25 | 688059.00 | |
| Refreshed | Approved | 79838.00 | 188335.88 | 245136.33 | 0.00 | 54859.50 | 106469.89 | 219150.00 | 3150000.00 |
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_GOODS_PRICE_y',max_value_AMT_GOODS_PRICE_y,'NAME_CLIENT_TYPE')
Observations
- Clients who are
NewandCanceledhave less median goods price compared toRepeaterandRefreshed - Clients who are
ApprovedandNewhave less median goods price compared toRepeaterandRefreshed
Analysis of NAME_CONTRACT_STATUS V/S AMT_CREDIT_y V/S NAME_PORTFOLIO¶
Outlier identification of AMT_CREDIT_y
max_value_AMT_CREDIT_y= outlier_range(df_merge,'AMT_CREDIT_y')
max_value_AMT_CREDIT_y
501779.25
Client with Payment difficulties
df_merge.groupby(by = ['NAME_PORTFOLIO','NAME_CONTRACT_STATUS']).AMT_CREDIT_y.describe().head()
| count | mean | std | min | 25% | 50% | 75% | max | ||
|---|---|---|---|---|---|---|---|---|---|
| NAME_PORTFOLIO | NAME_CONTRACT_STATUS | ||||||||
| Cards | Approved | 82407.00 | 227605.36 | 215931.33 | 0.00 | 67500.00 | 157500.00 | 292500.00 | 1350000.00 |
| Canceled | 417.00 | 228733.81 | 186123.23 | 45000.00 | 112500.00 | 180000.00 | 270000.00 | 900000.00 | |
| Refused | 39159.00 | 239762.70 | 207093.85 | 0.00 | 90000.00 | 180000.00 | 270000.00 | 2250000.00 | |
| Unused offer | 1.00 | 202500.00 | NaN | 202500.00 | 202500.00 | 202500.00 | 202500.00 | 202500.00 | |
| Cars | Approved | 241.00 | 1122764.27 | 644748.22 | 202500.00 | 735520.50 | 980100.00 | 1293556.50 | 4509688.50 |
fn_bi_boxplot_merge('NAME_CONTRACT_STATUS','AMT_CREDIT_y',max_value_AMT_CREDIT_y,'NAME_PORTFOLIO')
Observations
- Clients who have
Unused offerreceive more median credit inPOSportfolio - Clients who are
Refusedreceive more median credit inCashportfolio - Clients who are
Approvedreceive more median credit inCarsportfolio
Categorical V/S Categorical variables¶
Analysis of YEARS_BIRTH_CATEGORY V/S NAME_CONTRACT_STATUS¶
fn_bi_countplot_merge('YEARS_BIRTH_CATEGORY','NAME_CONTRACT_STATUS')
Observations
- Clients who are in the age range 30-40 get most approval followed by clients in 40-50 age range
- Clients who are in the age range 60-70 receive least refusals followed by 20-30 age range
Analysis of NAME_FAMILY_STATUS V/S NAME_CONTRACT_STATUS¶
fn_bi_countplot_merge('NAME_FAMILY_STATUS','NAME_CONTRACT_STATUS')
Observations
- Clients who are
Marriedreceive the most approvals
Analysis of NAME_EDUCATION_TYPE V/S NAME_CONTRACT_STATUS¶
fn_bi_countplot_merge('NAME_EDUCATION_TYPE','NAME_CONTRACT_STATUS')
Observations
- Clients who have
Secondary/secondary specialreceive the most approvals
Analysis of NAME_CLIENT_TYPE V/S NAME_CONTRACT_STATUS¶
fn_bi_countplot_merge('NAME_CLIENT_TYPE','NAME_CONTRACT_STATUS')
Observations
- Clients who are
Repeatersreceive the most approvals followed byNew
Conclusion: Client categories to be targeted for providing loan
- Clients who are employed for more than 19 years
- Clients in the age range 30-40 and 40-50
- Clients who are Married
- Male clients with Academic degree
- Students and Businessman
- Repeater clients